Hi everyone, It''s been a while now since I started working on trying to implement a mechanism for moving all the memory of a domain from one NUMA node to another. Yes, that is part of the more general work about improving Xen NUMA support I''m carrying on... For more details, see here: http://wiki.xen.org/wiki/Xen_NUMA_Roadmap. The approach I decided to take is to mimic a sort of back-to-back save/restore. In some more details, I''m suspending a domain, deallocating all its memory, reallocating it to different places (especially, for instance, if the NUMA node-affinity changed in the meanwhile), update the domain''s and the Xen''s address translation tables, and resume the domain back. Easy eh? :-D All the above happens at the libxc level, although the patch series provides all the glue code needed to interact with the new feature from both libxl and xl. Also, consider that this fisrt series focuses on PV guests. For HVM, some "if (hvm)" here and there will do the trick, together, of course, with the proper updating of HAP tables, etc. ... Still much less work than all the tweaking required by PV-guests! I''ll include more HVM bits in future releases of this series, but, in case you have, do feel free to provide you comments on that aspect too, even right now. I got sidetracked and distracted many times, and I have to admit, this is not quite a done job yet. However, I reached the point where, at least part of what I have can be shown here, so that you can provide some early feedback on it, and help me proceed further, with future design choices and implementation steps. I have to say I find it quite challenging as, especially for PV, it touches and exercises a lot of code paths and features I''m not yet so much familiar with. That is why feedback is really important, even if the thing is still at an early stage. For instances, discussing how to properly deal with things like grant tables, or TMEM, or how to make sure we do not mess up vCPU contexts, would be really great. Despite the RFC status, I did my best in trying to facilitate that, both when writing the code and the comments/changelogs for each patch... For instance, I''ve put some ''XXX'' marked spots where I thought something was missing and/or commenting is most needed. If you find 5 minutes to look into them, that would be much appreciated. :-) I know we''re in a very particular moment, due to 4.3 freeze, so I understand if people are busy finalizing the existing and proposed features, instead of reviewing RFCs for new ones, but I still felt like it would be worthwhile to send this out. Let''s see if anyone take this chance of telling me how bad it looks! ;-P About the series, the patches on which to concentrate on are, especially at this stage: 6/8 libxc: introduce xc_domain_move_memory 7/8 libxl: introduce libxl_domain_move_memory The others are introducing minor changes, ancillary to the two above. Thanks in advance and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Dario Faggioli
2013-Apr-09 02:49 UTC
[PATCH 1 of 8 [RFC]] xl: allow for node-wise specification of vcpu pinning
Making it possible to use something like the following: * "nodes:0-3": all pCPUs of nodes 0,1,2,3; * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 but not pCPU 6; * ... In both domain config file and `xl vcpu-pin''. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -125,6 +125,26 @@ run on cpu #3 of the host. =back +A C<CPU-LIST> may also be specified NUMA node-wise as follows: + +=over 4 + +=item "nodes:all" + +To allow all the vcpus of the guest to run on all the cpus of all the NUMA +nodes of the host. + +=item "nodes:0-3,node:^2" + +To allow all the vcpus of the guest to run on the cpus belonging to +the NUMA nodes 0,1,3 of the host. + +=back + +Combining the two is allowed. For instance, "1,node:2,^6" means all the +vcpus of the guest will run on cpus 1 and on all the cpus of NUMA node 2, +but not on cpu 6. + If this option is not specified, libxl automatically tries to place the new domain on the host''s NUMA nodes (provided the host has more than one NUMA node) by pinning it to the cpus of those nodes. A heuristic approach is diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -504,61 +504,99 @@ static void split_string_into_string_lis free(s); } +static int range_parse_bitmap(const char *str, libxl_bitmap *map) +{ + char *nstr, *endptr; + uint32_t ida, idb; + + ida = idb = strtoul(str, &endptr, 10); + if (endptr == str) + return EINVAL; + + if (*endptr == ''-'') { + nstr = endptr + 1; + idb = strtoul(nstr, &endptr, 10); + if (endptr == nstr) + return EINVAL; + } + + libxl_bitmap_set_none(map); + while (ida <= idb) { + libxl_bitmap_set(map, ida); + ida++; + } + + return 0; +} + static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) { - libxl_bitmap exclude_cpumap; - uint32_t cpuida, cpuidb; - char *endptr, *toka, *tokb, *saveptr = NULL; - int i, rc = 0, rmcpu; - - if (!strcmp(cpu, "all")) { + libxl_bitmap map, cpu_nodemap, *this_map; + char *ptr, *saveptr = NULL; + bool isnot, isnode; + int i, rc = 0; + + if (!strcmp(cpu, "all") || !strcmp(cpu, "nodes:all")) { libxl_bitmap_set_any(cpumap); return 0; } - if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0)) { - fprintf(stderr, "Error: Failed to allocate cpumap.\n"); - return ENOMEM; - } - - for (toka = strtok_r(cpu, ",", &saveptr); toka; - toka = strtok_r(NULL, ",", &saveptr)) { - rmcpu = 0; - if (*toka == ''^'') { - /* This (These) Cpu(s) will be removed from the map */ - toka++; - rmcpu = 1; - } - /* Extract a valid (range of) cpu(s) */ - cpuida = cpuidb = strtoul(toka, &endptr, 10); - if (endptr == toka) { + libxl_bitmap_init(&map); + libxl_bitmap_init(&cpu_nodemap); + + rc = libxl_node_bitmap_alloc(ctx, &cpu_nodemap, 0); + if (rc) { + fprintf(stderr, "libxl_node_bitmap_alloc failed.\n"); + goto out; + } + rc = libxl_cpu_bitmap_alloc(ctx, &map, 0); + if (rc) { + fprintf(stderr, "libxl_cpu_bitmap_alloc failed.\n"); + goto out; + } + + for (ptr = strtok_r(cpu, ",", &saveptr); ptr; + ptr = strtok_r(NULL, ",", &saveptr)) { + isnot = isnode = false; + + /* Are we dealing with cpus or nodes? */ + if (!strncmp(ptr, "node:", 5) || !strncmp(ptr, "nodes:", 6)) { + isnode = true; + ptr += 5 + (ptr[4] == ''s''); + } + /* Are we adding or removing cpus/nodes? */ + if (*ptr == ''^'') { + isnot = true; + ptr++; + } + /* Get in map a bitmap representative of the range */ + if (range_parse_bitmap(ptr, &map)) { fprintf(stderr, "Error: Invalid argument.\n"); rc = EINVAL; - goto vcpp_out; - } - if (*endptr == ''-'') { - tokb = endptr + 1; - cpuidb = strtoul(tokb, &endptr, 10); - if (endptr == tokb || cpuida > cpuidb) { - fprintf(stderr, "Error: Invalid argument.\n"); - rc = EINVAL; - goto vcpp_out; + goto out; + } + + /* Add or remove the specified cpus */ + if (isnode) { + rc = libxl_nodemap_to_cpumap(ctx, &map, &cpu_nodemap); + if (rc) { + fprintf(stderr, "libxl_nodemap_to_cpumap failed.\n"); + goto out; } - } - while (cpuida <= cpuidb) { - rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) : - libxl_bitmap_set(&exclude_cpumap, cpuida); - cpuida++; - } - } - - /* Clear all the cpus from the removal list */ - libxl_for_each_set_bit(i, exclude_cpumap) { - libxl_bitmap_reset(cpumap, i); - } - -vcpp_out: - libxl_bitmap_dispose(&exclude_cpumap); + this_map = &cpu_nodemap; + } else { + this_map = ↦ + } + + libxl_for_each_set_bit(i, *this_map) { + isnot ? libxl_bitmap_reset(cpumap, i) + : libxl_bitmap_set(cpumap, i); + } + } + + out: + libxl_bitmap_dispose(&map); + libxl_bitmap_dispose(&cpu_nodemap); return rc; }
Dario Faggioli
2013-Apr-09 02:49 UTC
[PATCH 2 of 8 [RFC]] xl: allow for changing NUMA node affinity on-line
by implementing the "node-affinity" command, acting pretty much like "vcpu-pin", although it of course affects node and not vcpu affinity. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1 --- a/docs/man/xl.pod.1 +++ b/docs/man/xl.pod.1 @@ -626,6 +626,29 @@ different run state is appropriate. Pin this, by ensuring certain VCPUs can only run on certain physical CPUs. +=item B<node-affinity> I<domain-id> I<nodes> + +Sets or changes the NUMA node affinity for the domain. All the future +memory allocations for the domain will use memory belonging to I<nodes>. +Also (if the credit scheduler is in use), the VCPUs of the domain will +run on the CPUs belonging to I<nodes> as much as possible. + +This is different than VCPU pinning, as VCPUs are not prohibited to run +on CPUs not belonging to I<nodes>, and that can happen, for instance, in +order to avoid having VCPUs waiting to run in some PCPU''s runqueue when +other PCPUs are idle. + +Changing a domain''s node-affinity does not affect all the memory that +has been allocated already, before the command is invoked. + +The keyword B<all> can be used to have the domain affine to all the +NUMA nodes in the host. The keyword B<none> can be used to reset the +node affinity. In that case, and from that point on, the node affinity +of the domain will be automatically calculated basing on its vcpu affinity +(see B<vcpu-pin> above). More specifically, the node affinity will be +constituted by the nodes to which the physical CPUs its VCPUs have +vcpu affinity with belong. + =item B<vm-list> Prints information about guests. This list excludes information about diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h --- a/tools/libxl/xl.h +++ b/tools/libxl/xl.h @@ -58,6 +58,7 @@ int main_vm_list(int argc, char **argv); int main_create(int argc, char **argv); int main_config_update(int argc, char **argv); int main_button_press(int argc, char **argv); +int main_nodeaffinity(int argc, char **argv); int main_vcpupin(int argc, char **argv); int main_vcpuset(int argc, char **argv); int main_memmax(int argc, char **argv); diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -601,6 +601,54 @@ static int vcpupin_parse(char *cpu, libx return rc; } +static int nodeaffinity_parse(char *node, libxl_bitmap *nodemap) +{ + char *ptr, *saveptr = NULL; + int i, rc = 0, isnot; + libxl_bitmap map; + + if (!strcmp(node, "all")) { + libxl_bitmap_set_any(nodemap); + return 0; + } else if (!strcmp(node, "none")) { + libxl_bitmap_set_none(nodemap); + return 0; + } + + rc = libxl_node_bitmap_alloc(ctx, &map, 0); + if (rc) { + fprintf(stderr, "Error: Failed to allocate nodemap.\n"); + goto out; + } + + for (ptr = strtok_r(node, ",", &saveptr); ptr; + ptr = strtok_r(NULL, ",", &saveptr)) { + isnot = false; + + /* Adding or removing nodes? */ + if (*ptr == ''^'') { + isnot = true; + ptr++; + } + /* Get in map a bitmap representative of the range */ + if (range_parse_bitmap(ptr, &map)) { + fprintf(stderr, "Error: Invalid argument.\n"); + rc = EINVAL; + goto out; + } + + libxl_for_each_set_bit(i, map) { + isnot ? libxl_bitmap_reset(nodemap, i) + : libxl_bitmap_set(nodemap, i); + } + } + + out: + libxl_bitmap_dispose(&map); + + return rc; +} + static void parse_config_data(const char *config_source, const char *config_data, int config_len, @@ -4583,6 +4631,39 @@ int main_vcpuset(int argc, char **argv) return 0; } +static void nodeaffinity(uint32_t domid, char *node) +{ + libxl_bitmap nodemap; + + if (libxl_node_bitmap_alloc(ctx, &nodemap, 0)) { + fprintf(stderr, "libxl_node_bitmap_alloc failed.\n"); + goto out; + } + + if (nodeaffinity_parse(node, &nodemap)) { + fprintf(stderr, "Could not parse node affinity.\n"); + goto out; + } + + if (libxl_domain_set_nodeaffinity(ctx, domid, &nodemap) == -1) + fprintf(stderr, "Could not set node affinity for dom `%d''.\n", domid); + + out: + libxl_bitmap_dispose(&nodemap); +} + +int main_nodeaffinity(int argc, char **argv) +{ + int opt; + + SWITCH_FOREACH_OPT(opt, "", NULL, "node-affinity", 2) { + /* No options */ + } + + nodeaffinity(find_domain(argv[optind]), argv[optind+1]); + return 0; +} + static void output_xeninfo(void) { const libxl_version_info *info; diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c --- a/tools/libxl/xl_cmdtable.c +++ b/tools/libxl/xl_cmdtable.c @@ -214,6 +214,11 @@ struct cmd_spec cmd_table[] = { "Set which CPUs a VCPU can use", "<Domain> <VCPU|all> <CPUs|all>", }, + { "node-affinity", + &main_nodeaffinity, 0, 1, + "Set the NUMA node affinity for the domain", + "<Domain> [<NODEs|all|none>]", + }, { "vcpu-set", &main_vcpuset, 0, 1, "Set the number of active VCPUs allowed for the domain",
Dario Faggioli
2013-Apr-09 02:49 UTC
[PATCH 3 of 8 [RFC]] libxc: introduce xc_domain_get_address_size
As a wrapper to XEN_DOMCTL_get_address_size, and use it wherever the call was being issued directly via do_domctl(), saving quite some line of code. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/tools/libxc/xc_core.c b/tools/libxc/xc_core.c --- a/tools/libxc/xc_core.c +++ b/tools/libxc/xc_core.c @@ -417,24 +417,6 @@ elfnote_dump_format_version(xc_interface return dump_rtn(xch, args, (char*)&format_version, sizeof(format_version)); } -static int -get_guest_width(xc_interface *xch, - uint32_t domid, - unsigned int *guest_width) -{ - DECLARE_DOMCTL; - - memset(&domctl, 0, sizeof(domctl)); - domctl.domain = domid; - domctl.cmd = XEN_DOMCTL_get_address_size; - - if ( do_domctl(xch, &domctl) != 0 ) - return 1; - - *guest_width = domctl.u.address_size.size / 8; - return 0; -} - int xc_domain_dumpcore_via_callback(xc_interface *xch, uint32_t domid, @@ -478,11 +460,12 @@ xc_domain_dumpcore_via_callback(xc_inter struct xc_core_section_headers *sheaders = NULL; Elf64_Shdr *shdr; - if ( get_guest_width(xch, domid, &dinfo->guest_width) != 0 ) + if ( xc_domain_get_address_size(xch, domid, &dinfo->guest_width) != 0 ) { PERROR("Could not get address size for domain"); return sts; } + dinfo->guest_width /= 8; xc_core_arch_context_init(&arch_ctxt); if ( (dump_mem_start = malloc(DUMP_INCREMENT*PAGE_SIZE)) == NULL ) diff --git a/tools/libxc/xc_cpuid_x86.c b/tools/libxc/xc_cpuid_x86.c --- a/tools/libxc/xc_cpuid_x86.c +++ b/tools/libxc/xc_cpuid_x86.c @@ -436,17 +436,15 @@ static void xc_cpuid_pv_policy( const unsigned int *input, unsigned int *regs) { DECLARE_DOMCTL; + unsigned int guest_width; int guest_64bit, xen_64bit = hypervisor_is_64bit(xch); char brand[13]; uint64_t xfeature_mask; xc_cpuid_brand_get(brand); - memset(&domctl, 0, sizeof(domctl)); - domctl.domain = domid; - domctl.cmd = XEN_DOMCTL_get_address_size; - do_domctl(xch, &domctl); - guest_64bit = (domctl.u.address_size.size == 64); + xc_domain_get_address_size(xch, domid, &guest_width); + guest_64bit = (guest_width == 64); /* Detecting Xen''s atitude towards XSAVE */ memset(&domctl, 0, sizeof(domctl)); diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -270,6 +270,21 @@ out: return ret; } +int xc_domain_get_address_size(xc_interface *xch, uint32_t domid, + unsigned int *addr_size) +{ + DECLARE_DOMCTL; + + memset(&domctl, 0, sizeof(domctl)); + domctl.domain = domid; + domctl.cmd = XEN_DOMCTL_get_address_size; + + if ( do_domctl(xch, &domctl) != 0 ) + return 1; + + *addr_size = domctl.u.address_size.size; + return 0; +} int xc_domain_getinfo(xc_interface *xch, uint32_t first_domid, diff --git a/tools/libxc/xc_offline_page.c b/tools/libxc/xc_offline_page.c --- a/tools/libxc/xc_offline_page.c +++ b/tools/libxc/xc_offline_page.c @@ -193,20 +193,15 @@ static int get_pt_level(xc_interface *xc unsigned int *pt_level, unsigned int *gwidth) { - DECLARE_DOMCTL; xen_capabilities_info_t xen_caps = ""; if (xc_version(xch, XENVER_capabilities, &xen_caps) != 0) return -1; - memset(&domctl, 0, sizeof(domctl)); - domctl.domain = domid; - domctl.cmd = XEN_DOMCTL_get_address_size; - - if ( do_domctl(xch, &domctl) != 0 ) + if (xc_domain_get_address_size(xch, domid, gwidth) != 0) return -1; - *gwidth = domctl.u.address_size.size / 8; + *gwidth /= 8; if (strstr(xen_caps, "xen-3.0-x86_64")) /* Depends on whether it''s a compat 32-on-64 guest */ diff --git a/tools/libxc/xc_pagetab.c b/tools/libxc/xc_pagetab.c --- a/tools/libxc/xc_pagetab.c +++ b/tools/libxc/xc_pagetab.c @@ -51,15 +51,13 @@ unsigned long xc_translate_foreign_addre pt_levels = (ctx.msr_efer&EFER_LMA) ? 4 : (ctx.cr4&CR4_PAE) ? 3 : 2; paddr = ctx.cr3 & ((pt_levels == 3) ? ~0x1full : ~0xfffull); } else { - DECLARE_DOMCTL; + unsigned int gwidth; vcpu_guest_context_any_t ctx; if (xc_vcpu_getcontext(xch, dom, vcpu, &ctx) != 0) return 0; - domctl.domain = dom; - domctl.cmd = XEN_DOMCTL_get_address_size; - if ( do_domctl(xch, &domctl) != 0 ) + if (xc_domain_get_address_size(xch, dom, &gwidth) != 0) return 0; - if (domctl.u.address_size.size == 64) { + if (gwidth == 64) { pt_levels = 4; paddr = (uint64_t)xen_cr3_to_pfn_x86_64(ctx.x64.ctrlreg[3]) << PAGE_SHIFT; diff --git a/tools/libxc/xc_resume.c b/tools/libxc/xc_resume.c --- a/tools/libxc/xc_resume.c +++ b/tools/libxc/xc_resume.c @@ -24,19 +24,6 @@ #include <xen/foreign/x86_64.h> #include <xen/hvm/params.h> -static int pv_guest_width(xc_interface *xch, uint32_t domid) -{ - DECLARE_DOMCTL; - domctl.domain = domid; - domctl.cmd = XEN_DOMCTL_get_address_size; - if ( xc_domctl(xch, &domctl) != 0 ) - { - PERROR("Could not get guest address size"); - return -1; - } - return domctl.u.address_size.size / 8; -} - static int modify_returncode(xc_interface *xch, uint32_t domid) { vcpu_guest_context_any_t ctxt; @@ -71,9 +58,9 @@ static int modify_returncode(xc_interfac else { /* Probe PV guest address width. */ - dinfo->guest_width = pv_guest_width(xch, domid); - if ( dinfo->guest_width < 0 ) + if ( xc_domain_get_address_size(xch, domid, &dinfo->guest_width) ) return -1; + dinfo->guest_width /= 8; } if ( (rc = xc_vcpu_getcontext(xch, domid, 0, &ctxt)) != 0 ) @@ -120,7 +107,8 @@ static int xc_domain_resume_any(xc_inter xc_dominfo_t info; int i, rc = -1; #if defined(__i386__) || defined(__x86_64__) - struct domain_info_context _dinfo = { .p2m_size = 0 }; + struct domain_info_context _dinfo = { .guest_width = 0, + .p2m_size = 0 }; struct domain_info_context *dinfo = &_dinfo; unsigned long mfn; vcpu_guest_context_any_t ctxt; @@ -147,7 +135,8 @@ static int xc_domain_resume_any(xc_inter return rc; } - dinfo->guest_width = pv_guest_width(xch, domid); + xc_domain_get_address_size(xch, domid, &dinfo->guest_width); + dinfo->guest_width /= 8; if ( dinfo->guest_width != sizeof(long) ) { ERROR("Cannot resume uncooperative cross-address-size guests"); diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -556,6 +556,18 @@ int xc_vcpu_getaffinity(xc_interface *xc int vcpu, xc_cpumap_t cpumap); + +/** + * This function will return the address size for the specified domain. + * + * @param xch a handle to an open hypervisor interface. + * @param domid the domain id one wants the address size width of. + * @param addr_size the address size. + */ +int xc_domain_get_address_size(xc_interface *xch, uint32_t domid, + unsigned int *addr_size); + + /** * This function will return information about one or more domains. It is * designed to iterate over the list of domains. If a single domain is diff --git a/tools/libxc/xg_save_restore.h b/tools/libxc/xg_save_restore.h --- a/tools/libxc/xg_save_restore.h +++ b/tools/libxc/xg_save_restore.h @@ -301,7 +301,6 @@ static inline int get_platform_info(xc_i { xen_capabilities_info_t xen_caps = ""; xen_platform_parameters_t xen_params; - DECLARE_DOMCTL; if (xc_version(xch, XENVER_platform_parameters, &xen_params) != 0) return 0; @@ -313,14 +312,10 @@ static inline int get_platform_info(xc_i *hvirt_start = xen_params.virt_start; - memset(&domctl, 0, sizeof(domctl)); - domctl.domain = dom; - domctl.cmd = XEN_DOMCTL_get_address_size; - - if ( do_domctl(xch, &domctl) != 0 ) + if ( xc_domain_get_address_size(xch, dom, guest_width) != 0) return 0; - *guest_width = domctl.u.address_size.size / 8; + *guest_width /= 8; /* 64-bit tools will see the 64-bit hvirt_start, but 32-bit guests * will be using the compat one. */ diff --git a/tools/xentrace/xenctx.c b/tools/xentrace/xenctx.c --- a/tools/xentrace/xenctx.c +++ b/tools/xentrace/xenctx.c @@ -892,12 +892,9 @@ static void dump_ctx(int vcpu) } ctxt_word_size = (strstr(xen_caps, "xen-3.0-x86_64")) ? 8 : 4; } else { - struct xen_domctl domctl; - memset(&domctl, 0, sizeof domctl); - domctl.domain = xenctx.domid; - domctl.cmd = XEN_DOMCTL_get_address_size; - if (xc_domctl(xenctx.xc_handle, &domctl) == 0) - ctxt_word_size = guest_word_size = domctl.u.address_size.size / 8; + unsigned int gw; + if ( !xc_domain_get_address_size(xenctx.xc_handle, xenctx.domid, &gw) ) + ctxt_word_size = guest_word_size = gw / 8; } } #endif
Dario Faggioli
2013-Apr-09 02:49 UTC
[PATCH 4 of 8 [RFC]] libxc: introduce xc_map_domain_meminfo (and xc_unmap_domain_meminfo)
And use it in xc_exchange_page(). This is basically because the following changes need something really similar to the set of steps that are here abstracted in these two functions. This is basically pure code motion and, despite of the change in the interface and in the signature of many functions, no functional change is involved. XXX: There is probably more room for using this in other places too, most likely, the save/restore code. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -21,6 +21,8 @@ */ #include "xc_private.h" +#include "xc_core.h" +#include "xg_private.h" #include "xg_save_restore.h" #include <xen/memory.h> #include <xen/hvm/hvm_op.h> @@ -1460,6 +1462,132 @@ int xc_domain_bind_pt_isa_irq( PT_IRQ_TYPE_ISA, 0, 0, 0, machine_irq)); } +int xc_unmap_domain_meminfo(xc_interface *xch, struct xc_domain_meminfo *minfo) +{ + struct domain_info_context _di = { .guest_width = minfo->guest_width }; + struct domain_info_context *dinfo = &_di; + + free(minfo->pfn_type); + if ( minfo->p2m_table ) + munmap(minfo->p2m_table, P2M_FLL_ENTRIES * PAGE_SIZE); + minfo->p2m_table = NULL; + + return 0; +} + +int xc_map_domain_meminfo(xc_interface *xch, int domid, + struct xc_domain_meminfo *minfo) +{ + struct domain_info_context _di; + struct domain_info_context *dinfo = &_di; + + xc_dominfo_t info; + shared_info_any_t *live_shinfo; + xen_capabilities_info_t xen_caps = ""; + int i; + + /* Only be initialized once */ + if ( minfo->pfn_type || minfo->p2m_table ) + { + errno = EINVAL; + return -1; + } + + if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 ) + { + PERROR("Could not get domain info"); + return -1; + } + + if ( xc_domain_get_address_size(xch, domid, &minfo->guest_width) ) + { + PERROR("Could not get domain address size"); + return -1; + } + minfo->guest_width /= 8; + _di.guest_width = minfo->guest_width; + + /* Get page table levels (see get_platform_info() in xg_save_restore.h */ + if ( xc_version(xch, XENVER_capabilities, &xen_caps) ) + { + PERROR("Could not get Xen capabilities (for page table levels)"); + return -1; + } + if ( strstr(xen_caps, "xen-3.0-x86_64") ) + /* Depends on whether it''s a compat 32-on-64 guest */ + minfo->pt_levels = ( (minfo->guest_width == 8) ? 4 : 3 ); + else if ( strstr(xen_caps, "xen-3.0-x86_32p") ) + minfo->pt_levels = 3; + else if ( strstr(xen_caps, "xen-3.0-x86_32") ) + minfo->pt_levels = 2; + else + { + errno = EFAULT; + return -1; + } + + /* We need the shared info page for mapping the P2M */ + live_shinfo = xc_map_foreign_range(xch, domid, PAGE_SIZE, PROT_READ, + info.shared_info_frame); + if ( !live_shinfo ) + { + PERROR("Could not map the shared info frame (MFN 0x%lx)", + info.shared_info_frame); + return -1; + } + + if ( xc_core_arch_map_p2m_writable(xch, minfo->guest_width, &info, + live_shinfo, &minfo->p2m_table, + &minfo->p2m_size) ) + { + PERROR("Could not map the P2M table"); + munmap(live_shinfo, PAGE_SIZE); + return -1; + } + munmap(live_shinfo, PAGE_SIZE); + _di.p2m_size = minfo->p2m_size; + + /* Make space and prepare for getting the PFN types */ + minfo->pfn_type = calloc(sizeof(*minfo->pfn_type), minfo->p2m_size); + if ( !minfo->pfn_type ) + { + PERROR("Could not allocate memory for the PFN types"); + goto failed; + } + for ( i = 0; i < minfo->p2m_size; i++ ) + minfo->pfn_type[i] = pfn_to_mfn(i, minfo->p2m_table, + minfo->guest_width); + + /* Retrieve PFN types in batches */ + for ( i = 0; i < minfo->p2m_size ; i+=1024 ) + { + int count = ((minfo->p2m_size - i ) > 1024 ) ? + 1024: (minfo->p2m_size - i); + + if ( xc_get_pfn_type_batch(xch, domid, count, minfo->pfn_type + i) ) + { + PERROR("Could not get %d-eth batch of PFN types", (i+1)/1024); + goto failed; + } + } + + return 0; + +failed: + if ( minfo->pfn_type ) + { + free(minfo->pfn_type); + minfo->pfn_type = NULL; + } + if ( minfo->p2m_table ) + { + munmap(minfo->p2m_table, P2M_FLL_ENTRIES * PAGE_SIZE); + minfo->p2m_table = NULL; + } + + return -1; +} + int xc_domain_memory_mapping( xc_interface *xch, uint32_t domid, diff --git a/tools/libxc/xc_offline_page.c b/tools/libxc/xc_offline_page.c --- a/tools/libxc/xc_offline_page.c +++ b/tools/libxc/xc_offline_page.c @@ -33,17 +33,6 @@ #include "xg_private.h" #include "xg_save_restore.h" -struct domain_mem_info{ - int domid; - unsigned int pt_level; - unsigned int guest_width; - xen_pfn_t *pfn_type; - xen_pfn_t *p2m_table; - unsigned long p2m_size; - xen_pfn_t *m2p_table; - int max_mfn; -}; - struct pte_backup_entry { xen_pfn_t table_mfn; @@ -180,141 +169,6 @@ static int xc_is_page_granted_v2(xc_inte return (i != gnt_num); } -static xen_pfn_t pfn_to_mfn(xen_pfn_t pfn, xen_pfn_t *p2m, int gwidth) -{ - return ((xen_pfn_t) ((gwidth==8)? - (((uint64_t *)p2m)[(pfn)]): - ((((uint32_t *)p2m)[(pfn)]) == 0xffffffffU ? - (-1UL) : - (((uint32_t *)p2m)[(pfn)])))); -} - -static int get_pt_level(xc_interface *xch, uint32_t domid, - unsigned int *pt_level, - unsigned int *gwidth) -{ - xen_capabilities_info_t xen_caps = ""; - - if (xc_version(xch, XENVER_capabilities, &xen_caps) != 0) - return -1; - - if (xc_domain_get_address_size(xch, domid, gwidth) != 0) - return -1; - - *gwidth /= 8; - - if (strstr(xen_caps, "xen-3.0-x86_64")) - /* Depends on whether it''s a compat 32-on-64 guest */ - *pt_level = ( (*gwidth == 8) ? 4 : 3 ); - else if (strstr(xen_caps, "xen-3.0-x86_32p")) - *pt_level = 3; - else if (strstr(xen_caps, "xen-3.0-x86_32")) - *pt_level = 2; - else - return -1; - - return 0; -} - -static int close_mem_info(xc_interface *xch, struct domain_mem_info *minfo) -{ - if (minfo->pfn_type) - free(minfo->pfn_type); - munmap(minfo->m2p_table, M2P_SIZE(minfo->max_mfn)); - munmap(minfo->p2m_table, P2M_FLL_ENTRIES * PAGE_SIZE); - minfo->p2m_table = minfo->m2p_table = NULL; - - return 0; -} - -static int init_mem_info(xc_interface *xch, int domid, - struct domain_mem_info *minfo, - xc_dominfo_t *info) -{ - uint64_aligned_t shared_info_frame; - shared_info_any_t *live_shinfo = NULL; - int i, rc; - - /* Only be initialized once */ - if (minfo->pfn_type || minfo->m2p_table || minfo->p2m_table) - return -EINVAL; - - if ( get_pt_level(xch, domid, &minfo->pt_level, - &minfo->guest_width) ) - { - ERROR("Unable to get PT level info."); - return -EFAULT; - } - dinfo->guest_width = minfo->guest_width; - - shared_info_frame = info->shared_info_frame; - - live_shinfo = xc_map_foreign_range(xch, domid, - PAGE_SIZE, PROT_READ, shared_info_frame); - if ( !live_shinfo ) - { - ERROR("Couldn''t map live_shinfo"); - return -EFAULT; - } - - if ( (rc = xc_core_arch_map_p2m_writable(xch, minfo->guest_width, - info, live_shinfo, &minfo->p2m_table, &minfo->p2m_size)) ) - { - ERROR("Couldn''t map p2m table %x\n", rc); - goto failed; - } - munmap(live_shinfo, PAGE_SIZE); - live_shinfo = NULL; - - dinfo->p2m_size = minfo->p2m_size; - - minfo->max_mfn = xc_maximum_ram_page(xch); - if ( !(minfo->m2p_table - xc_map_m2p(xch, minfo->max_mfn, PROT_READ, NULL)) ) - { - ERROR("Failed to map live M2P table"); - goto failed; - } - - /* Get pfn type */ - minfo->pfn_type = calloc(sizeof(*minfo->pfn_type), minfo->p2m_size); - if (!minfo->pfn_type) - { - ERROR("Failed to malloc pfn_type\n"); - goto failed; - } - - for (i = 0; i < minfo->p2m_size; i++) - minfo->pfn_type[i] = pfn_to_mfn(i, minfo->p2m_table, - minfo->guest_width); - - for (i = 0; i < minfo->p2m_size ; i+=1024) - { - int count = ((dinfo->p2m_size - i ) > 1024 ) ? 1024: (dinfo->p2m_size - i); - if ( ( rc = xc_get_pfn_type_batch(xch, domid, count, - minfo->pfn_type + i)) ) - { - ERROR("Failed to get pfn_type %x\n", rc); - goto failed; - } - } - return 0; - -failed: - if (minfo->pfn_type) - { - free(minfo->pfn_type); - minfo->pfn_type = NULL; - } - if (live_shinfo) - munmap(live_shinfo, PAGE_SIZE); - munmap(minfo->m2p_table, M2P_SIZE(minfo->max_mfn)); - munmap(minfo->p2m_table, P2M_FLL_ENTRIES * PAGE_SIZE); - minfo->p2m_table = minfo->m2p_table = NULL; - - return -1; -} - static int backup_ptes(xen_pfn_t table_mfn, int offset, struct pte_backup *backup) { @@ -404,7 +258,7 @@ static int __update_pte(xc_interface *xc } static int change_pte(xc_interface *xch, int domid, - struct domain_mem_info *minfo, + struct xc_domain_meminfo *minfo, struct pte_backup *backup, struct xc_mmu *mmu, pte_func func, @@ -414,7 +268,7 @@ static int change_pte(xc_interface *xch, uint64_t i; void *content = NULL; - pte_num = PAGE_SIZE / ((minfo->pt_level == 2) ? 4 : 8); + pte_num = PAGE_SIZE / ((minfo->pt_levels == 2) ? 4 : 8); for (i = 0; i < minfo->p2m_size; i++) { @@ -437,7 +291,7 @@ static int change_pte(xc_interface *xch, for (j = 0; j < pte_num; j++) { - if ( minfo->pt_level == 2 ) + if ( minfo->pt_levels == 2 ) pte = ((const uint32_t*)content)[j]; else pte = ((const uint64_t*)content)[j]; @@ -449,7 +303,7 @@ static int change_pte(xc_interface *xch, case 1: if ( xc_add_mmu_update(xch, mmu, table_mfn << PAGE_SHIFT | - j * ( (minfo->pt_level == 2) ? + j * ( (minfo->pt_levels == 2) ? sizeof(uint32_t): sizeof(uint64_t)) | MMU_PT_UPDATE_PRESERVE_AD, new_pte) ) @@ -482,7 +336,7 @@ failed: } static int update_pte(xc_interface *xch, int domid, - struct domain_mem_info *minfo, + struct xc_domain_meminfo *minfo, struct pte_backup *backup, struct xc_mmu *mmu, unsigned long new_mfn) @@ -492,7 +346,7 @@ static int update_pte(xc_interface *xch, } static int clear_pte(xc_interface *xch, int domid, - struct domain_mem_info *minfo, + struct xc_domain_meminfo *minfo, struct pte_backup *backup, struct xc_mmu *mmu, xen_pfn_t mfn) @@ -540,7 +394,7 @@ static int is_page_exchangable(xc_interf int xc_exchange_page(xc_interface *xch, int domid, xen_pfn_t mfn) { xc_dominfo_t info; - struct domain_mem_info minfo; + struct xc_domain_meminfo minfo; struct xc_mmu *mmu = NULL; struct pte_backup old_ptes = {NULL, 0, 0}; grant_entry_v1_t *gnttab_v1 = NULL; @@ -551,6 +405,8 @@ int xc_exchange_page(xc_interface *xch, int rc, result = -1; uint32_t status; xen_pfn_t new_mfn, gpfn; + xen_pfn_t *m2p_table; + int max_mfn; if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 ) { @@ -570,10 +426,26 @@ int xc_exchange_page(xc_interface *xch, return -EINVAL; } - /* Get domain''s memory information */ + /* Map M2P and obtain gpfn */ + max_mfn = xc_maximum_ram_page(xch); + if ( !(m2p_table = xc_map_m2p(xch, max_mfn, PROT_READ, NULL)) ) + { + PERROR("Failed to map live M2P table"); + return -EFAULT; + } + gpfn = m2p_table[mfn]; + + /* Map domain''s memory information */ memset(&minfo, 0, sizeof(minfo)); - init_mem_info(xch, domid, &minfo, &info); - gpfn = minfo.m2p_table[mfn]; + if ( xc_map_domain_meminfo(xch, domid, &minfo) ) + { + PERROR("Could not map domain''s memory information\n"); + return -EFAULT; + } + + /* For translation macros */ + dinfo->guest_width = minfo.guest_width; + dinfo->p2m_size = minfo.p2m_size; /* Don''t exchange CR3 for PAE guest in PAE host environment */ if (minfo.guest_width > sizeof(long)) @@ -763,7 +635,8 @@ failed: if (gnttab_v2) munmap(gnttab_v2, gnt_num / (PAGE_SIZE/sizeof(grant_entry_v2_t))); - close_mem_info(xch, &minfo); + xc_unmap_domain_meminfo(xch, &minfo); + munmap(m2p_table, M2P_SIZE(max_mfn)); return result; } diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h --- a/tools/libxc/xenguest.h +++ b/tools/libxc/xenguest.h @@ -274,6 +274,23 @@ int xc_exchange_page(xc_interface *xch, /** + * Memory related information, such as PFN types, the P2M table, + * the guest word width and the guest page table levels. + */ +struct xc_domain_meminfo { + unsigned int pt_levels; + unsigned int guest_width; + xen_pfn_t *pfn_type; + xen_pfn_t *p2m_table; + unsigned long p2m_size; +}; + +int xc_map_domain_meminfo(xc_interface *xch, int domid, + struct xc_domain_meminfo *minfo); + +int xc_unmap_domain_meminfo(xc_interface *xch, struct xc_domain_meminfo *mem); + +/** * This function map m2p table * @parm xch a handle to an open hypervisor interface * @parm max_mfn the max pfn diff --git a/tools/libxc/xg_private.h b/tools/libxc/xg_private.h --- a/tools/libxc/xg_private.h +++ b/tools/libxc/xg_private.h @@ -136,6 +136,15 @@ struct domain_info_context { unsigned long p2m_size; }; +static inline xen_pfn_t pfn_to_mfn(xen_pfn_t pfn, xen_pfn_t *p2m, int gwidth) +{ + return ((xen_pfn_t) ((gwidth==8)? + (((uint64_t *)p2m)[(pfn)]): + ((((uint32_t *)p2m)[(pfn)]) == 0xffffffffU ? + (-1UL) : + (((uint32_t *)p2m)[(pfn)])))); +} + /* Number of xen_pfn_t in a page */ #define FPP (PAGE_SIZE/(dinfo->guest_width))
Dario Faggioli
2013-Apr-09 02:49 UTC
[PATCH 5 of 8 [RFC]] libxc: allow for ctxt to be NULL in xc_vcpu_setcontext
since, as can be seen in xen/common/domctl.c, that is not a problem. Actually, it''s how calling vcpu_reset() is exported outside. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -1120,12 +1120,6 @@ int xc_vcpu_setcontext(xc_interface *xch DECLARE_HYPERCALL_BOUNCE(ctxt, sizeof(vcpu_guest_context_any_t), XC_HYPERCALL_BUFFER_BOUNCE_IN); int rc; - if (ctxt == NULL) - { - errno = EINVAL; - return -1; - } - if ( xc_hypercall_bounce_pre(xch, ctxt) ) return -1;
Dario Faggioli
2013-Apr-09 02:49 UTC
[PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory
as a mechanism of deallocating and reallocating (immediately!) _all_ the memory of a domain. Notice it relies on the guest being suspended already, before the function is invoked. Of course, it is quite likely that the memory ends up in different places from where it was before calling it but, for instance, the fact that this is actually a different NUMA node (or anything else) does not depend by any means from this function. In fact, here the guest pages are just freed and immediately re-allocated (you can see it as a very quick, back-to-back save-restore cycle). If the current domain configuration says, for instance, that new allocations should go to a specific NUMA node, then the whole domain is, as a matter of facts, moved there, but again, this is not something this function does explicitly. The way we do this is, very briefly, as follows: 1. drop all the references to all the pages of a domain, 2. backup the content of a batch of pages, 3. deallocate the a batch, 4. allocate a new set of pages for the batch, 5. copy the backed up content in the new pages, 6. if there are more pages, go back to 2, othwrwise 7. update the page tables, the vcpu contexts, the P2M, etc. The above raises a number of quite complex issues and, _not_all_ of them are being dealt with or solved in this series (RFC means something after all, doesn''t it? ;-P). XXX Open issues are: - HVM ("easy" to add, but it''s not in this patch. See the cover letter for the series); - PAE guests, as they need special attention for some of the page tables (should be trivial to add); - grant tables/granted pages: how to move them? - TMEM: how to "move" it? - shared/paged pages: what to do with them? - guest pages mapped in Xen, for instance: * vcpu info pages: moved but, how to update the mapping? * EOI page: moved but, how to update the mapping? Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile --- a/tools/libxc/Makefile +++ b/tools/libxc/Makefile @@ -48,6 +48,11 @@ else GUEST_SRCS-y += xc_nomigrate.c endif +# XXX: Well, for sure there are some X86-ism in the current code. +# Making it more ARM friendly should not be a big deal though, +# will do for next release. +GUEST_SRCS-$(CONFIG_X86) += xc_domain_movemem.c + vpath %.c ../../xen/common/libelf CFLAGS += -I../../xen/common/libelf diff --git a/tools/libxc/xc_domain_movemem.c b/tools/libxc/xc_domain_movemem.c new file mode 100644 --- /dev/null +++ b/tools/libxc/xc_domain_movemem.c @@ -0,0 +1,766 @@ +/****************************************************************************** + * xc_domain_movemem.c + * + * Deallocate and reallocate all the memory of a domain. + * + * Copyright (c) 2013, Dario Faggioli. + * Copyright (c) 2012, Citrix Systems, Inc. + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; + * version 2.1 of the License. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include <inttypes.h> +#include <time.h> +#include <stdlib.h> +#include <unistd.h> +#include <sys/time.h> +#include <xc_core.h> + +#include "xc_private.h" +#include "xc_dom.h" +#include "xg_private.h" +#include "xg_save_restore.h" + +/* Needed from translation macros in xg_private.h */ +static struct domain_info_context _dinfo; +static struct domain_info_context *dinfo = &_dinfo; + +#define MAX_BATCH_SIZE 1024 +#define MAX_PIN_BATCH 1024 + +#define MFN_IS_IN_PSEUDOPHYS_MAP(_mfn, _max_mfn, _minfo, _m2p) \ + (((_mfn) < (_max_mfn)) && ((mfn_to_pfn(_mfn, _m2p) < (_minfo).p2m_size) && \ + (pfn_to_mfn(mfn_to_pfn(_mfn, _m2p), (_minfo).p2m_table, \ + (_minfo).guest_width) == (_mfn)))) + +/* + * This is to determine which entries in this page table hold reserved + * hypervisor mappings. This depends on the current page table type as + * well as the number of paging levels (see also xc_domain_save.c). + * + * XXX: export this function so that it can be used both here and from + * canonicalize_pagetable(), in xc_domain_save.c. + */ +static int is_xen_mapping(struct xc_domain_meminfo *minfo, unsigned long type, + unsigned long hvirt_start, unsigned long m2p_mfn0, + const void *spage, int pte) +{ + int xen_start, xen_end, pte_last; + + xen_start = xen_end = pte_last = PAGE_SIZE / 8; + + if ( (minfo->pt_levels == 3) && (type == XEN_DOMCTL_PFINFO_L3TAB) ) + xen_start = L3_PAGETABLE_ENTRIES_PAE; + + /* + * In PAE only the L2 mapping the top 1GB contains Xen mappings. + * We can spot this by looking for the guest''s mappingof the m2p. + * Guests must ensure that this check will fail for other L2s. + */ + if ( (minfo->pt_levels == 3) && (type == XEN_DOMCTL_PFINFO_L2TAB) ) + { + int hstart; + uint64_t he; + + hstart = (hvirt_start >> L2_PAGETABLE_SHIFT_PAE) & 0x1ff; + he = ((const uint64_t *) spage)[hstart]; + + if ( ((he >> PAGE_SHIFT) & MFN_MASK_X86) == m2p_mfn0 ) + { + /* hvirt starts with xen stuff... */ + xen_start = hstart; + } + else if ( hvirt_start != 0xf5800000 ) + { + /* old L2s from before hole was shrunk... */ + hstart = (0xf5800000 >> L2_PAGETABLE_SHIFT_PAE) & 0x1ff; + he = ((const uint64_t *) spage)[hstart]; + if ( ((he >> PAGE_SHIFT) & MFN_MASK_X86) == m2p_mfn0 ) + xen_start = hstart; + } + } + + if ( (minfo->pt_levels == 4) && (type == XEN_DOMCTL_PFINFO_L4TAB) ) + { + /* + * XXX SMH: should compute these from hvirt_start (which we have) + * and hvirt_end (which we don''t) + */ + xen_start = 256; + xen_end = 272; + } + + return pte >= xen_start && pte < xen_end; +} + +/* + * This function will basically deallocate _all_ the memory of a domain and + * reallocate it immediately. It relies on the guest being suspended + * already, before the function is even invoked. + * + * Of course, it is quite likely that the memory ends up in different places + * from where it was before calling this but, for instance, the fact that + * this is actually a different NUMA node (or anything else) does not + * depend by any means from this function. In fact, here the guest pages are + * just freed and immediately re-allocated (you can see it as a very quick, + * back-to-back domain_save--domain_restore). If the current domain + * configuration says, for instance, that new allocation should go to a + * different NUMA nodes, then the whole domain is moved to there, but again, + * this is not something this function does explicitly. + * + * If actually interested in doing something like that (i.e., moving the + * domain to a different NUMA node), calling xc_domain_node_setaffinity() + * right before this should achieve it. + */ +int xc_domain_move_memory(xc_interface *xch, uint32_t domid/*, int hvm*/) +{ + unsigned int i, j; + int rc = 1; + + xc_dominfo_t info; + struct xc_domain_meminfo minfo; + + struct mmuext_op pin[MAX_PIN_BATCH]; + unsigned int nr_pins; + + struct xc_mmu *mmu = NULL; + unsigned int xen_pt_levels, dom_guest_width; + unsigned long max_mfn, hvirt_start, m2p_mfn0; + vcpu_guest_context_any_t ctxt; + + void *live_p2m_frame_list_list = NULL; + void *live_p2m_frame_list = NULL; + + /* + * XXX: grant tables & granted pages need to be considered, e.g., + * using xc_is_page_granted_vX() in xc_offline_page.c to + * recognise them, etc. + int gnt_num; + grant_entry_v1_t *gnttab_v1 = NULL; + grant_entry_v2_t *gnttab_v2 = NULL; + */ + + void *old_p, *new_p, *backup = NULL; + unsigned long mfn, pfn; + uint64_t fll; + + xen_pfn_t *new_mfns= NULL, *old_mfns = NULL, *batch_pfns = NULL; + int pte_num = PAGE_SIZE / 8, cleared_pte = 0; + xen_pfn_t *m2p_table, *orig_m2p = NULL; + shared_info_any_t *live_shinfo = NULL; + + unsigned long n = 0, n_skip = 0; + + int debug = 0; /* XXX will become a parameter */ + + if ( !get_platform_info(xch, domid, &max_mfn, &hvirt_start, + &xen_pt_levels, &dom_guest_width) ) + { + ERROR("Failed getting platform info"); + return 1; + } + + /* We expect domain to be suspende already */ + if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 ) + { + PERROR("Failed getting domain info"); + return 1; + } + if ( !info.shutdown || info.shutdown_reason != SHUTDOWN_suspend) + { + PERROR("Domain appears not to be suspended"); + return 1; + } + + DBGPRINTF("Establishing the mappings for M2P and P2M"); + memset(&minfo, 0, sizeof(minfo)); + if ( !(m2p_table = xc_map_m2p(xch, max_mfn, PROT_READ, &m2p_mfn0)) ) + { + PERROR("Failed to map the M2P table"); + return 1; + } + if ( xc_map_domain_meminfo(xch, domid, &minfo) ) + { + PERROR("Failed to map domain''s memory information"); + goto out; + } + dinfo->guest_width = minfo.guest_width; + dinfo->p2m_size = minfo.p2m_size; + + /* + * XXX + DBGPRINTF("Mapping the grant tables"); + gnttab_v2 = xc_gnttab_map_table_v2(xch, domid, &gnt_num); + if (!gnttab_v2) + { + PERROR("Failed to map V1 grant table... Trying V1"); + gnttab_v1 = xc_gnttab_map_table_v1(xch, domid, &gnt_num); + if (!gnttab_v1) + { + PERROR("Failed to map grant table"); + goto out; + } + } + DBGPRINTF("Grant table mapped. %d grants found", gnt_num); + */ + + mmu = xc_alloc_mmu_updates(xch, (domid+1)<<16|domid); + if ( mmu == NULL ) + { + PERROR("Failed to allocate memory for MMU updates"); + goto out; + } + + /* Alloc support data structures */ + new_mfns = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t)); + old_mfns = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t)); + batch_pfns = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t)); + + backup = malloc(PAGE_SIZE * MAX_BATCH_SIZE); + + orig_m2p = calloc(max_mfn, sizeof(xen_pfn_t)); + + if ( !new_mfns || !old_mfns || !batch_pfns || !backup || !orig_m2p ) + { + ERROR("Failed to allocate copying and/or backup data structures"); + goto out; + } + + DBGPRINTF("Saving the original M2P"); + memcpy(orig_m2p, m2p_table, max_mfn * sizeof(xen_pfn_t)); + + DBGPRINTF("Starting deallocating and reallocating all memory for domain %d" + "\n\tnr_pages=%lu, nr_shared_pages=%lu, nr_paged_pages=%lu" + "\n\tnr_online_vcpus=%u, max_vcpu_id=%u", + domid, info.nr_pages, info.nr_shared_pages, info.nr_paged_pages, + info.nr_online_vcpus, info.max_vcpu_id); + + /* Beware: no going back from this point!! */ + + /* + * As a part of the process of dropping all the references to the existing + * pages in memory, so that we can free (and then re-allocate them) we need + * to unpin them. + * + * We do that in batches of 1024 PFNs at each step, to amortize the cost + * of xc_mmuext_op() calls. + */ + nr_pins = 0; + for ( i = 0; i < minfo.p2m_size; i++ ) + { + if ( (minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 ) + continue; + + pin[nr_pins].cmd = MMUEXT_UNPIN_TABLE; + pin[nr_pins].arg1.mfn = minfo.p2m_table[i]; + nr_pins++; + + if ( nr_pins == MAX_PIN_BATCH ) + { + if ( xc_mmuext_op(xch, pin, nr_pins, domid) < 0 ) + { + PERROR("Failed to unpin a batch of %d MFNs", nr_pins); + goto out; + } + else + DBGPRINTF("Unpinned a batch of %d MFNs", nr_pins); + nr_pins = 0; + } + } + if ( (nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, domid) < 0) ) + { + PERROR("Failed to unpin a batch of %d MFNs", nr_pins); + goto out; + } + else + DBGPRINTF("Unpinned a batch of %d MFNs", nr_pins); + + /* + * After unpinning, we also need to remove the _PAGE_PRESENT bit from + * the domain''s PTEs, for the pages that we want to deallocate, or they + * just could not go away. + */ + for (i = 0; i < minfo.p2m_size; i++) + { + void *content; + xen_pfn_t table_type, table_mfn = pfn_to_mfn(i, minfo.p2m_table, + minfo.guest_width); + + if ( table_mfn == INVALID_P2M_ENTRY || + minfo.pfn_type[i] == XEN_DOMCTL_PFINFO_XTAB ) + { + DBGPRINTF("Broken P2M entry at PFN 0x%x", i); + continue; + } + + table_type = minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK; + if ( table_type < XEN_DOMCTL_PFINFO_L1TAB || + table_type > XEN_DOMCTL_PFINFO_L4TAB ) + continue; + + content = xc_map_foreign_range(xch, domid, PAGE_SIZE, + PROT_READ, table_mfn); + if ( !content ) + { + PERROR("Failed to map the table at MFN 0x%lx", table_mfn); + goto out; + } + + /* Go through each PTE of each table and clear the _PAGE_PRESENT bit */ + for ( j = 0; j < pte_num; j++ ) + { + uint64_t pte = ((uint64_t *)content)[j]; + + if ( !pte || is_xen_mapping(&minfo, table_type, hvirt_start, m2p_mfn0, content, j) ) + continue; + + if ( debug ) + DBGPRINTF("Entry %d: PTE=0x%lx, MFN=0x%lx, PFN=0x%lx", j, pte, + (uint64_t)((pte & MADDR_MASK_X86)>>PAGE_SHIFT), + m2p_table[(unsigned long)((pte & MADDR_MASK_X86) + >>PAGE_SHIFT)]); + + pfn = m2p_table[(pte & MADDR_MASK_X86)>>PAGE_SHIFT]; + pte &= ~_PAGE_PRESENT; + + if ( xc_add_mmu_update(xch, mmu, table_mfn << PAGE_SHIFT | + (j * (sizeof(uint64_t))) | + MMU_PT_UPDATE_PRESERVE_AD, pte) ) + PERROR("Failed to add some PTE update operation"); + else + cleared_pte++; + } + + if (content) + munmap(content, PAGE_SIZE); + } + if ( cleared_pte && xc_flush_mmu_updates(xch, mmu) ) + { + PERROR("Failed flushing some PTE update operations"); + goto out; + } + else + DBGPRINTF("Cleared presence for %d PTEs", cleared_pte); + + /* Scan all the P2M ... */ + while ( n < minfo.p2m_size ) + { + /* ... But all operations are done in batches */ + for ( i = 0; (i < MAX_BATCH_SIZE) && (n < minfo.p2m_size); n++ ) + { + xen_pfn_t mfn = pfn_to_mfn(n, minfo.p2m_table, minfo.guest_width); + xen_pfn_t mfn_type = minfo.pfn_type[n] & XEN_DOMCTL_PFINFO_LTAB_MASK; + + if (mfn == INVALID_P2M_ENTRY || !is_mapped(mfn) ) + { + if ( debug ) + DBGPRINTF("Skipping invalid or unmapped MFN 0x%lx", mfn); + n_skip++; + continue; + } + if ( mfn_type == XEN_DOMCTL_PFINFO_BROKEN || + mfn_type == XEN_DOMCTL_PFINFO_XTAB || + mfn_type == XEN_DOMCTL_PFINFO_XALLOC ) + { + if ( debug ) + DBGPRINTF("Skippong broken or alloc only MFN 0x%lx", mfn); + n_skip++; + continue; + } + + /* + if ( gnttab_v1 ? + xc_is_page_granted_v1(xch, mfn, gnttab_v1, gnt_num) : + xc_is_page_granted_v2(xch, mfn, gnttab_v2, gnt_num) ) + { + n_skip++; + continue; + } + */ + + old_mfns[i] = mfn; + batch_pfns[i] = n; + i++; + } + + /* Was the batch empty? */ + if ( i == 0) + continue; + + /* + * And now the core of the whole thing: map the PFNs in the batch, + * backup them, allocate new pages for them, and copy them there. + * We do this in this order, and we pass through a local backup, + * because we don''t want to risk hitting the max_mem limit for + * the domain (which would be possible, depending on MAX_BATCH_SIZE, + * if we try to do it like allocate->copy->deallocate). + * + * With MAX_BATCH_SIZE of 1024 and 4K pages, this means we are moving + * 4MB of guest memory for each batch. + */ + + /* Map and backup */ + old_p = xc_map_foreign_pages(xch, domid, PROT_READ, old_mfns, i); + if ( !old_p ) + { + PERROR("Failed mapping the current MFNs\n"); + goto out; + } + memcpy(backup, old_p, PAGE_SIZE * i); + munmap(old_p, PAGE_SIZE * i); + + /* Deallocation and re-allocation */ + if ( xc_domain_decrease_reservation(xch, domid, i, 0, old_mfns) != i || + xc_domain_populate_physmap_exact(xch, domid, i, 0, 0, new_mfns) ) + { + PERROR("Failed making space or allocating the new MFNs\n"); + munmap(backup, PAGE_SIZE * i); + goto out; + } + + /* Map of new pages, copy content and unmap */ + new_p = xc_map_foreign_pages(xch, domid, PROT_WRITE, new_mfns, i); + if ( !new_p ) + { + PERROR("Failed mapping the new MFNs\n"); + munmap(backup, PAGE_SIZE * i); + goto out; + } + memcpy(new_p, backup, PAGE_SIZE * i); + munmap(new_p, PAGE_SIZE * i); + munmap(backup, PAGE_SIZE * i); + + /* + * Since we already have the new MFNs, we can update both the M2P + * and the P2M right here, within this same loop. + */ + for ( j = 0; j < i; j++ ) + { + minfo.p2m_table[batch_pfns[j]] = new_mfns[j]; + if ( xc_add_mmu_update(xch, mmu, + (((uint64_t)new_mfns[j]) << PAGE_SHIFT) | + MMU_MACHPHYS_UPDATE, batch_pfns[j]) ) + { + PERROR("Failed updating M2P\n"); + goto out; + } + } + if ( xc_flush_mmu_updates(xch, mmu) ) + { + PERROR("Failed updating M2P\n"); + goto out; + } + + DBGPRINTF("Batch %lu/%ld done (%lu pages skipped)", + n / MAX_BATCH_SIZE, minfo.p2m_size / MAX_BATCH_SIZE, n_skip); + } + + /* + * Finally (oh, well...) update the PTEs of the domain again, putting + * the new MFNs there, and making the entries _PAGE_PRESENT again. + * + * This is a kind-of uncanonicalization, like it happens in save-resrote, + * although a very special one, and we rely on the snapshot of the M2P + * we made before starting all the deallocation/reallocation process. + */ + for ( i = 0; i < minfo.p2m_size; i++ ) + { + void *content; + xen_pfn_t table_type, table_mfn = pfn_to_mfn(i, minfo.p2m_table, + minfo.guest_width); + + table_type = minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK; + if ( table_type < XEN_DOMCTL_PFINFO_L1TAB || + table_type > XEN_DOMCTL_PFINFO_L4TAB ) + continue; + + /* We of course only care about tables */ + content = xc_map_foreign_range(xch, domid, PAGE_SIZE, + PROT_WRITE, table_mfn); + if ( !content ) + { + PERROR("Failed to map the table at MFN 0x%lx", table_mfn); + continue; + } + + for ( j = 0; j < PAGE_SIZE / 8; j++ ) + { + uint64_t pte = ((uint64_t *)content)[j]; + + if ( !pte || is_xen_mapping(&minfo, table_type, hvirt_start, m2p_mfn0, content, j) ) + continue; + + /* + * Basically, we lookup the PFN from the snapshoted M2P and we + * pick up the new MFN from the P2M (since we updated it "live" + * during the re-allocation phase above). + */ + mfn = (pte >> PAGE_SHIFT) & MFN_MASK_X86; + pfn = orig_m2p[mfn]; + + if ( debug ) + DBGPRINTF("Table[PTE]: 0x%lx[%d] ==> orig_m2p[0x%lx]=0x%lx, " + "p2m[0x%lx]=0x%lx // pte: 0x%lx --> 0x%lx", + table_mfn, j, mfn, pfn, pfn, minfo.p2m_table[pfn], + pte, (uint64_t)((pte & ~MADDR_MASK_X86)| + (minfo.p2m_table[pfn]<<PAGE_SHIFT)| + _PAGE_PRESENT)); + + mfn = minfo.p2m_table[pfn]; + pte &= ~MADDR_MASK_X86; + pte |= (uint64_t)mfn << PAGE_SHIFT; + pte |= _PAGE_PRESENT; + + ((uint64_t *)content)[j] = pte; + + if ( !MFN_IS_IN_PSEUDOPHYS_MAP(mfn, max_mfn, minfo, m2p_table) ) + { + ERROR("Failed updating entry %d in table at MFN 0x%lx", j, table_mfn); + continue; // XXX + } + } + + if ( content ) + munmap(content, PAGE_SIZE); + } + + DBGPRINTF("Re-pinning page table MFNs"); + + /* Pin the able types again */ + nr_pins = 0; + for ( i = 0; i < minfo.p2m_size; i++ ) + { + if ( (minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 ) + continue; + + switch ( minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) + { + case XEN_DOMCTL_PFINFO_L1TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE; + break; + + case XEN_DOMCTL_PFINFO_L2TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L2_TABLE; + break; + + case XEN_DOMCTL_PFINFO_L3TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L3_TABLE; + break; + + case XEN_DOMCTL_PFINFO_L4TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L4_TABLE; + break; + default: + continue; + } + pin[nr_pins].arg1.mfn = minfo.p2m_table[i]; + nr_pins++; + + if ( nr_pins == MAX_PIN_BATCH ) + { + if ( xc_mmuext_op(xch, pin, nr_pins, domid) < 0 ) + { + PERROR("Failed to pin a batch of %d MFNs", nr_pins); + goto out; + } + else + DBGPRINTF("Re-pinned a batch of %d MFNs", nr_pins); + nr_pins = 0; + } + } + if ( (nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, domid) < 0) ) + { + PERROR("Failed to pin batch of %d page tables", nr_pins); + goto out; + } + else + DBGPRINTF("Re-pinned a batch of %d MFNs", nr_pins); + + /* + * Now, take care of the vCPUs contextes. It all happens as above, + * we use the original M2P and the new domain''s P2M to update all + * the various references. + */ + for ( i = 0; i <= info.max_vcpu_id; i++ ) + { + xc_vcpuinfo_t vinfo; + + DBGPRINTF("Adjusting context for VCPU%d", i); + + if ( xc_vcpu_getinfo(xch, domid, i, &vinfo) ) + { + PERROR("Failed getting info for VCPU%d", i); + goto out; + } + if ( !vinfo.online ) + { + DBGPRINTF("VCPU%d seems offline", i); + continue; + } + + if ( xc_vcpu_getcontext(xch, domid, i, &ctxt) ) + { + PERROR("No context for VCPU%d", i); + goto out; + } + + if ( i == 0 ) + { + //start_info_any_t *start_info; + + /* + * Update the start info frame number. It is the 3rd argument + * to the HYPERVISOR_sched_op hypercall when op is + * SCHEDOP_shutdown and reason is SHUTDOWN_suspend, so we find + * it in EDX. + */ + mfn = GET_FIELD(&ctxt, user_regs.edx); + mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)]; + SET_FIELD(&ctxt, user_regs.edx, mfn); + + /* + * XXX: I checke, and store_mfn and console_mfn seemed ok, at + * least from a ''mapping'' point of view, but more testing is + * needed. + start_info = xc_map_foreign_range(xch, domid, PAGE_SIZE, PROT_READ | PROT_WRITE, mfn); + munmap(start_info, PAGE_SIZE); + */ + } + + /* GDT pointing MFNs */ + for ( j = 0; (512*j) < GET_FIELD(&ctxt, gdt_ents); j++ ) + { + mfn = GET_FIELD(&ctxt, gdt_frames[j]); + mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)]; + SET_FIELD(&ctxt, gdt_frames[j], mfn); + } + + /* CR3 XXX: PAE needs special attenion here, I think */ + mfn = UNFOLD_CR3(GET_FIELD(&ctxt, ctrlreg[3])); + mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)]; + SET_FIELD(&ctxt, ctrlreg[3], FOLD_CR3(mfn)); + + /* Guest pagetable (x86/64) in CR1 */ + if ( (minfo.pt_levels == 4) && ctxt.x64.ctrlreg[1] ) + { + /* + * XXX: save-restore code mangle with the least-significant + * bit (''valid PFN''). This should not be needed in here. + */ + mfn = UNFOLD_CR3(ctxt.x64.ctrlreg[1]); + mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)]; + ctxt.x64.ctrlreg[1] = FOLD_CR3(mfn); + } + + /* + * XXX: Xen refuses to set a new context for an existing vCPU if + * things like CR3, the GDTs have changed, even if the domain + * is suspended. Going through re-initializing the vCPU (by + * this one call below with a NULL ctxt) makes it possible, + * but is that sensible? And even if yes, is that the following + * _setcontext call issued below enough? + */ + if ( xc_vcpu_setcontext(xch, domid, i, NULL) ) + { + PERROR("Failed re-initialising VCPU%d", i); + goto out; + } + if ( xc_vcpu_setcontext(xch, domid, i, &ctxt) ) + { + PERROR("Failed when updating context for VCPU%d", i); + goto out; + } + } + + /* + * Finally (an this time for real), we take care of the pages mapping + * the P2M, and of the P2M entries themselves. + */ + + live_shinfo = xc_map_foreign_range(xch, domid, + PAGE_SIZE, PROT_READ|PROT_WRITE, info.shared_info_frame); + if ( !live_shinfo ) + { + PERROR("Failed mapping live_shinfo"); + goto out; + } + + fll = GET_FIELD(live_shinfo, arch.pfn_to_mfn_frame_list_list); + fll = minfo.p2m_table[mfn_to_pfn(fll, orig_m2p)]; + live_p2m_frame_list_list = xc_map_foreign_range(xch, domid, PAGE_SIZE, + PROT_READ|PROT_WRITE, fll); + if ( !live_p2m_frame_list_list ) + { + PERROR("Couldn''t map live_p2m_frame_list_list"); + goto out; + } + SET_FIELD(live_shinfo, arch.pfn_to_mfn_frame_list_list, fll); + + /* First, update the frames caontaining the list of the P2M frames */ + for ( i = 0; i < P2M_FLL_ENTRIES; i++ ) + { + + mfn = ((uint64_t *)live_p2m_frame_list_list)[i]; + mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)]; + ((uint64_t *)live_p2m_frame_list_list)[i] = mfn; + } + + live_p2m_frame_list + xc_map_foreign_pages(xch, domid, PROT_READ|PROT_WRITE, + live_p2m_frame_list_list, + P2M_FLL_ENTRIES); + if ( !live_p2m_frame_list ) + { + PERROR("Couldn''t map live_p2m_frame_list"); + goto out; + } + + /* And then update the actual entries of it */ + for ( i = 0; i < P2M_FL_ENTRIES; i++ ) + { + mfn = ((uint64_t *)live_p2m_frame_list)[i]; + mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)]; + ((uint64_t *)live_p2m_frame_list)[i] = mfn; + } + + rc = 0; + + out: + if ( live_p2m_frame_list_list ) + munmap(live_p2m_frame_list_list, PAGE_SIZE); + if ( live_p2m_frame_list ) + munmap(live_p2m_frame_list, P2M_FLL_ENTRIES * PAGE_SIZE); + if ( live_shinfo ) + munmap(live_shinfo, PAGE_SIZE); + + free(mmu); + free(new_mfns); + free(old_mfns); + free(batch_pfns ); + free(backup); + free(orig_m2p); + + /* + if (gnttab_v1) + munmap(gnttab_v1, gnt_num / (PAGE_SIZE/sizeof(grant_entry_v1_t))); + if (gnttab_v2) + munmap(gnttab_v2, gnt_num / (PAGE_SIZE/sizeof(grant_entry_v2_t))); + */ + + xc_unmap_domain_meminfo(xch, &minfo); + munmap(m2p_table, M2P_SIZE(max_mfn)); + + return !!rc; +} diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h --- a/tools/libxc/xenguest.h +++ b/tools/libxc/xenguest.h @@ -272,6 +272,15 @@ int xc_query_page_offline_status(xc_inte int xc_exchange_page(xc_interface *xch, int domid, xen_pfn_t mfn); +/** + * This function deallocates all the guests memory and allocates it + * again and immediately, with the net effect of moving it somewhere + * else wrt where it is when the function is invoked. + * + * @param xch a handle to an open hypervisor interface. + * @param domid the domain id one wants to move the memory of. + */ +int xc_domain_move_memory(xc_interface *xch, uint32_t domid/*, int hvm*/); /** * Memory related information, such as PFN types, the P2M table, diff --git a/tools/libxc/xg_private.h b/tools/libxc/xg_private.h --- a/tools/libxc/xg_private.h +++ b/tools/libxc/xg_private.h @@ -145,6 +145,11 @@ static inline xen_pfn_t pfn_to_mfn(xen_p (((uint32_t *)p2m)[(pfn)])))); } +static inline xen_pfn_t mfn_to_pfn(xen_pfn_t mfn, xen_pfn_t *m2p) +{ + return m2p[mfn]; +} + /* Number of xen_pfn_t in a page */ #define FPP (PAGE_SIZE/(dinfo->guest_width))
Dario Faggioli
2013-Apr-09 02:49 UTC
[PATCH 7 of 8 [RFC]] libxl: introduce libxl_domain_move_memory
and make use of it in `xl node-affinity'' (introduced by one of the previous patches). This way, users of both, library and cmd line can ask deallocation and reallocation of the whole memory pages belonging to a given domain. In `xl node-affinity'', if the ''-M'' option is used, the above happens right after changing the NUMA node affinity of the domain itself. This produces the move of the domain from the (set of) NUMA node(s) where it belongs now, to somewhere else. In libxl, what was needed was a way of requesting for domain suspension, without triggering the whole save/send part, and avoiding its (potential) asynchronicity. Perhaps, in future, and if we want that, the _whole_ suspend+move operation can be made asynchronous, but not the single pieces of it. That is achieved by calling directly the core of the suspend routine, with a simplified save context. Of course, this brings no change to the actual suspend/resume and save/restore behaviour, wrt the current one. XXX: Update man pages and documentation still to be done. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -853,6 +853,81 @@ int libxl_domain_unpause(libxl_ctx *ctx, return rc; } +/* + * This function will suspend the domain and invoke xc_domain_move_memory(). + * The xc_ call will deallocate and reallocate all the memory of the domain. + * Notice that this function _does_not_ resume the domain on its own, and + * that needs to be done manually by the caller. + * + * This means that, if the allocation policy (e.g., the node affinity) + * for the domain changed (right) before calling this function, at the + * end of it all the domain''s memory will be compliant with that policy. + */ +int libxl_move_memory(libxl_ctx *ctx, uint32_t domid) +{ + GC_INIT(ctx); + libxl__domain_suspend_state *dss; + int rc = 0; + + libxl_domain_type type = libxl__domain_type(gc, domid); + if (type == LIBXL_DOMAIN_TYPE_INVALID) + abort(); + + /* + * First of all, we need to suspend the domain. We use the core + * suspen code from libxl_dom.c, calling it directly instea of having + * libxc calling back into it. For that reason, we need a dss, although + * only some of the fields are relevant in our case. + */ + GCNEW(dss); + dss->domid = domid; + dss->hvm = type == LIBXL_DOMAIN_TYPE_HVM; + dss->suspend_eventchn = -1; + dss->guest_responded = 0; + dss->dm_savefile = NULL; + + /* Try to initialize the suspend event channel */ + dss->xce = xc_evtchn_open(NULL, 0); + if (dss->xce == NULL) { + rc = ERROR_FAIL; + goto out; + } else { + int port = xs_suspend_evtchn_port(domid); + + if (port >= 0) { + dss->suspend_eventchn + xc_suspend_evtchn_init(ctx->xch, dss->xce, domid, port); + + if (dss->suspend_eventchn < 0) + LOG(WARN, "Suspend event channel initialization failed"); + } + } + + if (!libxl__do_domain_suspend(gc, dss)) + { + rc = ERROR_GUEST_TIMEDOUT; + goto out; + } + + rc = xc_domain_move_memory(ctx->xch, domid); + if (rc) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "moving memory"); + rc = ERROR_BADFAIL; + goto out; + } + + out: + /* Tear down the suspend event channel (if successfully initialized) */ + if (dss->suspend_eventchn > 0) + xc_suspend_evtchn_release(CTX->xch, dss->xce, domid, + dss->suspend_eventchn); + if (dss->xce != NULL) + xc_evtchn_close(dss->xce); + + GC_FREE; + return rc; +} + int libxl__domain_pvcontrol_available(libxl__gc *gc, uint32_t domid) { libxl_ctx *ctx = libxl__gc_owner(gc); diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -950,6 +950,8 @@ int libxl_flask_getenforce(libxl_ctx *ct int libxl_flask_setenforce(libxl_ctx *ctx, int mode); int libxl_flask_loadpolicy(libxl_ctx *ctx, void *policy, uint32_t size); +int libxl_move_memory(libxl_ctx *ctx, uint32_t domid); + /* misc */ /* Each of these sets or clears the flag according to whether the diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -991,11 +991,8 @@ int libxl__domain_resume_device_model(li return 0; } -int libxl__domain_suspend_common_callback(void *user) +int libxl__do_domain_suspend(libxl__gc *gc, libxl__domain_suspend_state *dss) { - libxl__save_helper_state *shs = user; - libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs); - STATE_AO_GC(dss->ao); unsigned long hvm_s_state = 0, hvm_pvdrv = 0; int ret; char *state = "suspend"; @@ -1125,6 +1122,16 @@ int libxl__domain_suspend_common_callbac return 1; } + +int libxl__domain_suspend_common_callback(void *user) +{ + libxl__save_helper_state *shs = user; + libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs); + STATE_AO_GC(dss->ao); + + return libxl__do_domain_suspend(gc, dss); +} + static inline char *physmap_path(libxl__gc *gc, uint32_t domid, char *phys_offset, char *node) { diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2541,6 +2541,9 @@ struct libxl__domain_create_state { void libxl__xc_domain_saverestore_async_callback_done(libxl__egc *egc, libxl__save_helper_state *shs, int return_value); +_hidden int libxl__do_domain_suspend(libxl__gc *gc, + libxl__domain_suspend_state *dss); + _hidden int libxl__domain_suspend_common_callback(void *data); _hidden void libxl__domain_suspend_common_switch_qemu_logdirty (int domid, unsigned int enable, void *data); diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -4654,13 +4654,31 @@ static void nodeaffinity(uint32_t domid, int main_nodeaffinity(int argc, char **argv) { - int opt; - - SWITCH_FOREACH_OPT(opt, "", NULL, "node-affinity", 2) { - /* No options */ + int movemem = 0; + int opt = 0; + + SWITCH_FOREACH_OPT(opt, "M", NULL, "node-affinity", 2) { + case ''M'': + movemem = 1; + break; + } + + if (argc - optind > 3) { + help("nodeaffinity"); + return 2; } nodeaffinity(find_domain(argv[optind]), argv[optind+1]); + + if (movemem) { + int rc = libxl_move_memory(ctx, find_domain(argv[optind])); + + if (rc < 0) + fprintf(stderr, "Failed to save domain, trying to resume\n"); + + libxl_domain_resume(ctx, find_domain(argv[optind]), 1, NULL); + } + return 0; } diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c --- a/tools/libxl/xl_cmdtable.c +++ b/tools/libxl/xl_cmdtable.c @@ -217,7 +217,10 @@ struct cmd_spec cmd_table[] = { { "node-affinity", &main_nodeaffinity, 0, 1, "Set the NUMA node affinity for the domain", - "<Domain> [<NODEs|all|none>]", + "[options] <Domain> [<NODEs|all|none>]", + "-M Move the memory to th nodes corresponding to CPUs\n" + " (involves suspending/resuming the domain, so some\n" + " downtime is to be expected)", }, { "vcpu-set", &main_vcpuset, 0, 1,
Dario Faggioli
2013-Apr-09 02:49 UTC
[PATCH 8 of 8 [RFC]] tools/misc: introduce xen-mfndump
It''s a little tool, useful when trying to figure out what goes on in both the host and the guests memory, i.e., stuff like MFN to PFN mappings, MFN/PFN mappings in a guest''s PTEs, etc. This is what it does as of now: $ /usr/sbin/xen-mfndump Usage: xen-mfndump <command> [args] Commands: help show this help dump-m2p show M2P dump-p2m <domid> show P2M of <domid> dump-ptes <domid> <mfn> show the PTEs in <mfn> lookup-pte <domid> <mfn> find the PTE mapping <mfn> memcmp-mfns <domid1> <mfn1> <domid2> <mfn2> (str)compare content of <mfn1> & <mfn2> It''s probably far from perfect, but it reveals quite useful when debugging the kind of issues introduced by this series. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/tools/misc/Makefile b/tools/misc/Makefile --- a/tools/misc/Makefile +++ b/tools/misc/Makefile @@ -10,7 +10,7 @@ CFLAGS += $(CFLAGS_libxenstore) HDRS = $(wildcard *.h) TARGETS-y := xenperf xenpm xen-tmem-list-parse gtraceview gtracestat xenlockprof xenwatchdogd xencov -TARGETS-$(CONFIG_X86) += xen-detect xen-hvmctx xen-hvmcrash xen-lowmemd +TARGETS-$(CONFIG_X86) += xen-detect xen-hvmctx xen-hvmcrash xen-lowmemd xen-mfndump TARGETS-$(CONFIG_MIGRATE) += xen-hptool TARGETS := $(TARGETS-y) @@ -24,7 +24,7 @@ INSTALL_BIN := $(INSTALL_BIN-y) INSTALL_SBIN-y := xm xen-bugtool xen-python-path xend xenperf xsview xenpm xen-tmem-list-parse gtraceview \ gtracestat xenlockprof xenwatchdogd xen-ringwatch xencov -INSTALL_SBIN-$(CONFIG_X86) += xen-hvmctx xen-hvmcrash xen-lowmemd +INSTALL_SBIN-$(CONFIG_X86) += xen-hvmctx xen-hvmcrash xen-lowmemd xen-mfndump INSTALL_SBIN-$(CONFIG_MIGRATE) += xen-hptool INSTALL_SBIN := $(INSTALL_SBIN-y) @@ -77,6 +77,9 @@ xenlockprof: xenlockprof.o xen-hptool: xen-hptool.o $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(LDLIBS_libxenguest) $(LDLIBS_libxenstore) $(APPEND_LDFLAGS) +xen-mfndump: xen-mfndump.o + $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(LDLIBS_libxenguest) $(LDLIBS_libxenstore) $(APPEND_LDFLAGS) + xenwatchdogd: xenwatchdogd.o $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS) diff --git a/tools/misc/xen-mfndump.c b/tools/misc/xen-mfndump.c new file mode 100644 --- /dev/null +++ b/tools/misc/xen-mfndump.c @@ -0,0 +1,425 @@ +#include <xenctrl.h> +#include <xc_private.h> +#include <xc_core.h> +#include <errno.h> +#include <unistd.h> + +#include "xg_save_restore.h" + +#define ARRAY_SIZE(a) (sizeof (a) / sizeof ((a)[0])) + +static xc_interface *xch; + +int help_func(int argc, char *argv[]) +{ + fprintf(stderr, + "Usage: xen-mfndump <command> [args]\n" + "Commands:\n" + " help show this help\n" + " dump-m2p show M2P\n" + " dump-p2m <domid> show P2M of <domid>\n" + " dump-ptes <domid> <mfn> show the PTEs in <mfn>\n" + " lookup-pte <domid> <mfn> find the PTE mapping <mfn>\n" + " memcmp-mfns <domid1> <mfn1> <domid2> <mfn2>\n" + " compare content of <mfn1> & <mfn2>\n" + ); + + return 0; +} + +int dump_m2p_func(int argc, char *argv[]) +{ + unsigned long i, max_mfn; + xen_pfn_t *m2p_table; + + if ( argc > 0 ) + { + help_func(0, NULL); + return 1; + } + + /* Map M2P and obtain gpfn */ + max_mfn = xc_maximum_ram_page(xch); + if ( !(m2p_table = xc_map_m2p(xch, max_mfn, PROT_READ, NULL)) ) + { + ERROR("Failed to map live M2P table"); + return -1; + } + + printf(" --- Dumping M2P ---\n"); + printf(" Max MFN: %lu\n", max_mfn); + for ( i = 0; i < max_mfn; i++ ) + { + printf(" mfn=0x%lx ==> pfn=0x%lx\n", i, m2p_table[i]); + } + printf(" --- End of M2P ---\n"); + + munmap(m2p_table, M2P_SIZE(max_mfn)); + return 0; +} + +int dump_p2m_func(int argc, char *argv[]) +{ + struct xc_domain_meminfo minfo; + xc_dominfo_t info; + unsigned long i; + int domid; + + if ( argc < 1 ) + { + help_func(0, NULL); + return 1; + } + domid = atoi(argv[0]); + + if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 || + info.domid != domid ) + { + ERROR("Failed to obtain info for domain %d\n", domid); + return -1; + } + + /* Retrieve all the info about the domain''s memory */ + memset(&minfo, 0, sizeof(minfo)); + if ( xc_map_domain_meminfo(xch, domid, &minfo) ) + { + ERROR("Could not map domain %d memory information\n", domid); + return -1; + } + + printf(" --- Dumping P2M for domain %d ---\n", domid); + printf(" Guest Width: %u, PT Levels: %u P2M size: = %lu\n", + minfo.guest_width, minfo.pt_levels, minfo.p2m_size); + for ( i = 0; i < minfo.p2m_size; i++ ) + { + unsigned long pagetype = minfo.pfn_type[i] & + XEN_DOMCTL_PFINFO_LTAB_MASK; + + printf(" pfn=0x%lx ==> mfn=0x%lx (type 0x%lx)", i, minfo.p2m_table[i], + pagetype >> XEN_DOMCTL_PFINFO_LTAB_SHIFT); + + if ( is_mapped(minfo.p2m_table[i]) ) + printf(" [mapped]"); + + if ( pagetype & XEN_DOMCTL_PFINFO_LPINTAB ) + printf (" [pinned]"); + + if ( pagetype == XEN_DOMCTL_PFINFO_XTAB ) + printf(" [xtab]"); + if ( pagetype == XEN_DOMCTL_PFINFO_BROKEN ) + printf(" [broken]"); + if ( pagetype == XEN_DOMCTL_PFINFO_XALLOC ) + printf( " [xalloc]"); + + switch ( pagetype & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) + { + case XEN_DOMCTL_PFINFO_L1TAB: + printf(" L1 table"); + break; + + case XEN_DOMCTL_PFINFO_L2TAB: + printf(" L2 table"); + break; + + case XEN_DOMCTL_PFINFO_L3TAB: + printf(" L3 table"); + break; + + case XEN_DOMCTL_PFINFO_L4TAB: + printf(" L4 table"); + break; + } + + printf("\n"); + } + printf(" --- End of P2M for domain %d ---\n", domid); + + xc_unmap_domain_meminfo(xch, &minfo); + return 0; +} + +int dump_ptes_func(int argc, char *argv[]) +{ + struct xc_domain_meminfo minfo; + xc_dominfo_t info; + void *page = NULL; + unsigned long i, max_mfn; + int domid, pte_num, rc = 0; + xen_pfn_t pfn, mfn, *m2p_table; + + if ( argc < 2 ) + { + help_func(0, NULL); + return 1; + } + domid = atoi(argv[0]); + mfn = strtoul(argv[1], NULL, 16); + + if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 || + info.domid != domid ) + { + ERROR("Failed to obtain info for domain %d\n", domid); + return -1; + } + + /* Retrieve all the info about the domain''s memory */ + memset(&minfo, 0, sizeof(minfo)); + if ( xc_map_domain_meminfo(xch, domid, &minfo) ) + { + ERROR("Could not map domain %d memory information\n", domid); + return -1; + } + + /* Map M2P and obtain gpfn */ + max_mfn = xc_maximum_ram_page(xch); + if ( (mfn > max_mfn) || + !(m2p_table = xc_map_m2p(xch, max_mfn, PROT_READ, NULL)) ) + { + xc_unmap_domain_meminfo(xch, &minfo); + ERROR("Failed to map live M2P table"); + return -1; + } + + pfn = m2p_table[mfn]; + if ( pfn >= minfo.p2m_size ) + { + ERROR("pfn 0x%lx out of range for domain %d\n", pfn, domid); + rc = -1; + goto out; + } + + if ( !(minfo.pfn_type[pfn] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK) ) + { + ERROR("pfn 0x%lx for domain %d is not a PT\n", pfn, domid); + rc = -1; + goto out; + } + + page = xc_map_foreign_range(xch, domid, PAGE_SIZE, PROT_READ, + minfo.p2m_table[pfn]); + if ( !page ) + { + ERROR("Failed to map 0x%lx\n", minfo.p2m_table[pfn]); + rc = -1; + goto out; + } + + pte_num = PAGE_SIZE / 8; + + printf(" --- Dumping %d PTEs for domain %d ---\n", pte_num, domid); + printf(" Guest Width: %u, PT Levels: %u P2M size: = %lu\n", + minfo.guest_width, minfo.pt_levels, minfo.p2m_size); + printf(" pfn: 0x%lx, mfn: 0x%lx", + pfn, minfo.p2m_table[pfn]); + switch ( minfo.pfn_type[pfn] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) + { + case XEN_DOMCTL_PFINFO_L1TAB: + printf(", L1 table"); + break; + case XEN_DOMCTL_PFINFO_L2TAB: + printf(", L2 table"); + break; + case XEN_DOMCTL_PFINFO_L3TAB: + printf(", L3 table"); + break; + case XEN_DOMCTL_PFINFO_L4TAB: + printf(", L4 table"); + break; + } + if ( minfo.pfn_type[pfn] & XEN_DOMCTL_PFINFO_LPINTAB ) + printf (" [pinned]"); + if ( is_mapped(minfo.p2m_table[pfn]) ) + printf(" [mapped]"); + printf("\n"); + + for ( i = 0; i < pte_num; i++ ) + printf(" pte[%lu]: 0x%lx\n", i, ((const uint64_t*)page)[i]); + + printf(" --- End of PTEs for domain %d, pfn=0x%lx (mfn=0x%lx) ---\n", + domid, pfn, minfo.p2m_table[pfn]); + + out: + munmap(page, PAGE_SIZE); + xc_unmap_domain_meminfo(xch, &minfo); + munmap(m2p_table, M2P_SIZE(max_mfn)); + return rc; +} + +int lookup_pte_func(int argc, char *argv[]) +{ + struct xc_domain_meminfo minfo; + xc_dominfo_t info; + void *page = NULL; + unsigned long i, j; + int domid, pte_num; + xen_pfn_t mfn; + + if ( argc < 2 ) + { + help_func(0, NULL); + return 1; + } + domid = atoi(argv[0]); + mfn = strtoul(argv[1], NULL, 16); + + if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 || + info.domid != domid ) + { + ERROR("Failed to obtain info for domain %d\n", domid); + return -1; + } + + /* Retrieve all the info about the domain''s memory */ + memset(&minfo, 0, sizeof(minfo)); + if ( xc_map_domain_meminfo(xch, domid, &minfo) ) + { + ERROR("Could not map domain %d memory information\n", domid); + return -1; + } + + pte_num = PAGE_SIZE / 8; + + printf(" --- Lookig for PTEs mapping mfn 0x%lx for domain %d ---\n", + mfn, domid); + printf(" Guest Width: %u, PT Levels: %u P2M size: = %lu\n", + minfo.guest_width, minfo.pt_levels, minfo.p2m_size); + + for ( i = 0; i < minfo.p2m_size; i++ ) + { + if ( !(minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK) ) + continue; + + page = xc_map_foreign_range(xch, domid, PAGE_SIZE, PROT_READ, + minfo.p2m_table[i]); + if ( !page ) + continue; + + for ( j = 0; j < pte_num; j++ ) + { + uint64_t pte = ((const uint64_t*)page)[j]; + +#define __MADDR_BITS_X86 ((minfo.guest_width == 8) ? 52 : 44) +#define __MFN_MASK_X86 ((1ULL << (__MADDR_BITS_X86 - PAGE_SHIFT_X86)) - 1) + if ( ((pte >> PAGE_SHIFT_X86) & __MFN_MASK_X86) == mfn) + printf(" 0x%lx <-- [0x%lx][%lu]: 0x%lx\n", + mfn, minfo.p2m_table[i], j, pte); +#undef __MADDR_BITS_X86 +#undef __MFN_MASK_X8 + } + + munmap(page, PAGE_SIZE); + page = NULL; + } + + xc_unmap_domain_meminfo(xch, &minfo); + + return 1; +} + +int memcmp_mfns_func(int argc, char *argv[]) +{ + xc_dominfo_t info1, info2; + void *page1 = NULL, *page2 = NULL; + int domid1, domid2; + xen_pfn_t mfn1, mfn2; + int rc = 0; + + if ( argc < 4 ) + { + help_func(0, NULL); + return 1; + } + domid1 = atoi(argv[0]); + domid2 = atoi(argv[2]); + mfn1 = strtoul(argv[1], NULL, 16); + mfn2 = strtoul(argv[3], NULL, 16); + + if ( xc_domain_getinfo(xch, domid1, 1, &info1) != 1 || + xc_domain_getinfo(xch, domid2, 1, &info2) != 1 || + info1.domid != domid1 || info2.domid != domid2) + { + ERROR("Failed to obtain info for domains\n"); + return -1; + } + + page1 = xc_map_foreign_range(xch, domid1, PAGE_SIZE, PROT_READ, mfn1); + page2 = xc_map_foreign_range(xch, domid2, PAGE_SIZE, PROT_READ, mfn2); + if ( !page1 || !page2 ) + { + ERROR("Failed to map either 0x%lx[dom %d] or 0x%lx[dom %d]\n", + mfn1, domid1, mfn2, domid2); + rc = -1; + goto out; + } + + printf(" --- Comparing the content of 2 MFNs ---\n"); + printf(" 1: 0x%lx[dom %d], 2: 0x%lx[dom %d]\n", + mfn1, domid1, mfn2, domid2); + printf(" memcpy(1, 2) = %d\n", memcmp(page1, page2, PAGE_SIZE)); + + out: + munmap(page1, PAGE_SIZE); + munmap(page2, PAGE_SIZE); + return rc; +} + + + +struct { + const char *name; + int (*func)(int argc, char *argv[]); +} opts[] = { + { "help", help_func }, + { "dump-m2p", dump_m2p_func }, + { "dump-p2m", dump_p2m_func }, + { "dump-ptes", dump_ptes_func }, + { "lookup-pte", lookup_pte_func }, + { "memcmp-mfns", memcmp_mfns_func}, +}; + +int main(int argc, char *argv[]) +{ + int i, ret; + + if (argc < 2) + { + help_func(0, NULL); + return 1; + } + + xch = xc_interface_open(0, 0, 0); + if ( !xch ) + { + ERROR("Failed to open an xc handler"); + return 1; + } + + for ( i = 0; i < ARRAY_SIZE(opts); i++ ) + { + if ( !strncmp(opts[i].name, argv[1], strlen(argv[1])) ) + break; + } + + if ( i == ARRAY_SIZE(opts) ) + { + fprintf(stderr, "Unknown option ''%s''", argv[1]); + help_func(0, NULL); + return 1; + } + + ret = opts[i].func(argc - 2, argv + 2); + + xc_interface_close(xch); + + return !!ret; +} + +/* + * Local variables: + * mode: C + * c-set-style: "BSD" + * c-basic-offset: 4 + * tab-width: 4 + * indent-tabs-mode: nil + * End: + */
Juergen Gross
2013-Apr-09 05:23 UTC
Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory
On 09.04.2013 04:49, Dario Faggioli wrote:> as a mechanism of deallocating and reallocating (immediately!) _all_ > the memory of a domain. Notice it relies on the guest being suspended > already, before the function is invoked.Is this solution intended to be the final one? This might be okay for a domain with less than 1GB of memory, but I see problems for really huge domains. The needed time to copy the memory might result in long offline times. For this case something like live migration (optional?) would be a better solution, I think. Juergen -- Juergen Gross Principal Developer Operating Systems PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html
Dario Faggioli
2013-Apr-09 06:56 UTC
Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory
On mar, 2013-04-09 at 06:23 +0100, Juergen Gross wrote:> On 09.04.2013 04:49, Dario Faggioli wrote: > > as a mechanism of deallocating and reallocating (immediately!) _all_ > > the memory of a domain. Notice it relies on the guest being suspended > > already, before the function is invoked. > > Is this solution intended to be the final one? >Well, the idea of sharing the patches, even at this stage, was right about discussing that! :-P> This might be okay for a domain with less than 1GB of memory, but I see > problems for really huge domains. The needed time to copy the memory might > result in long offline times. >I see what you mean. I thought about approaches that copy only a specific part of the memory, perhaps according to some usage statistics. I''ve not yet abandoned that idea, but I honestly think that, if we go through the suspend-copy-resume (which is pretty much the only thing I can do with PV guests, isn''t it?), that can''t be for a page or two, or the impact of the overhead would be even higher!> For this case something like live migration > (optional?) would be a better solution, I think. >Well, I thought about that too, and I''m open to thinking/discussing/hearing suggestions about how to implement a "live phase" for this. The problem is, with a more migration-alike approach, I''ll end up doubling the memory requirements of, potentially, all the domains (since I''d need space for storing the full RAM image of each one!), which I don''t think it is an acceptable requirement either, _especially_ for big guests, is it? :-( Thanks for you interest, :-) Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Juergen Gross
2013-Apr-09 08:13 UTC
Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory
On 09.04.2013 08:56, Dario Faggioli wrote:> On mar, 2013-04-09 at 06:23 +0100, Juergen Gross wrote: >> On 09.04.2013 04:49, Dario Faggioli wrote: >>> as a mechanism of deallocating and reallocating (immediately!) _all_ >>> the memory of a domain. Notice it relies on the guest being suspended >>> already, before the function is invoked. >> >> Is this solution intended to be the final one? >> > Well, the idea of sharing the patches, even at this stage, was right > about discussing that! :-P > >> This might be okay for a domain with less than 1GB of memory, but I see >> problems for really huge domains. The needed time to copy the memory might >> result in long offline times. >> > I see what you mean. I thought about approaches that copy only a > specific part of the memory, perhaps according to some usage statistics. > > I''ve not yet abandoned that idea, but I honestly think that, if we go > through the suspend-copy-resume (which is pretty much the only thing I > can do with PV guests, isn''t it?), that can''t be for a page or two, or > the impact of the overhead would be even higher! > >> For this case something like live migration >> (optional?) would be a better solution, I think. >> > Well, I thought about that too, and I''m open to > thinking/discussing/hearing suggestions about how to implement a "live > phase" for this. > > The problem is, with a more migration-alike approach, I''ll end up > doubling the memory requirements of, potentially, all the domains (since > I''d need space for storing the full RAM image of each one!), which I > don''t think it is an acceptable requirement either, _especially_ for big > guests, is it? :-(What about the following approach: - do the migration in chunks (like 1GB, may be configurable) - don''t move pages which are already on one of the target nodes - try to allocate memory on the target node while the domain is still running. If this fails, there is no need to move that chunk. Depending on the page size requirements (huge pages) decide whether the move is aborted or done partially. - in case of successful allocation suspend the domain, do the copy and update page tables for the copied pages, then resume the domain - free the memory chunk on the old node(s) - repeat until either no memory obtained or move is finished This will have higher overhead, but the domain will be suspended for only short periods of time. The memory requirements don''t matter, as the additional memory will be allocated only for a short period of time. Additionally this approach is more secure, as the domain can''t end in suspended state without memory (you don''t have to avoid ballooning or creation of other domains during the move). Juergen -- Juergen Gross Principal Developer Operating Systems PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html
Dario Faggioli
2013-Apr-09 08:51 UTC
Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory
On mar, 2013-04-09 at 09:13 +0100, Juergen Gross wrote:> What about the following approach: >In general, I like it... More details below.> - do the migration in chunks (like 1GB, may be configurable) >Yes, provided these chunks are big enough, I think the overhead of is acceptable.> - don''t move pages which are already on one of the target nodes >Yep, that is definitely sane, and was already on my TODO list (although, you''re right, I forgot to mention it in the cover or in the various changelogs). It''s not there yet because I''m missing a way of knowing on what node a page is, but I''m already working of putting it together. Anyway, I agree on this too, and thanks for pointing that out. :-)> - try to allocate memory on the target node while the domain is still running. > If this fails, there is no need to move that chunk. Depending on the page > size requirements (huge pages) decide whether the move is aborted or done > partially. > - in case of successful allocation suspend the domain, do the copy and update > page tables for the copied pages, then resume the domain >This is also fine, the only issue being that I''d probably need to fiddle with the domain max_mem, and stuff like that, wouldn''t I? I''m saying this because, when testing the few that I sent already, I run right into this when I was trying to do it in the allocate-copy-deallocate order (of course, depending on how big a chunk is, but this is going to be much less than 1GB!). Do you see what I mean? Do you think it would be nice to increase the domain''s "memory allowance" (temporarily, of course) for this to be possible?> - free the memory chunk on the old node(s) > - repeat until either no memory obtained or move is finished > > This will have higher overhead, but the domain will be suspended for only > short periods of time. The memory requirements don''t matter, as the additional > memory will be allocated only for a short period of time. >Yep, this all makes sense, with the only nit being the max_mem issue above. Thanks again and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Juergen Gross
2013-Apr-09 09:16 UTC
Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory
On 09.04.2013 10:51, Dario Faggioli wrote:> On mar, 2013-04-09 at 09:13 +0100, Juergen Gross wrote: >> What about the following approach: >> > In general, I like it... More details below. > >> - do the migration in chunks (like 1GB, may be configurable) >> > Yes, provided these chunks are big enough, I think the overhead of is > acceptable. > >> - don''t move pages which are already on one of the target nodes >> > Yep, that is definitely sane, and was already on my TODO list (although, > you''re right, I forgot to mention it in the cover or in the various > changelogs). It''s not there yet because I''m missing a way of knowing on > what node a page is, but I''m already working of putting it together. > > Anyway, I agree on this too, and thanks for pointing that out. :-) > >> - try to allocate memory on the target node while the domain is still running. >> If this fails, there is no need to move that chunk. Depending on the page >> size requirements (huge pages) decide whether the move is aborted or done >> partially. >> - in case of successful allocation suspend the domain, do the copy and update >> page tables for the copied pages, then resume the domain >> > This is also fine, the only issue being that I''d probably need to fiddle > with the domain max_mem, and stuff like that, wouldn''t I? I''m saying > this because, when testing the few that I sent already, I run right into > this when I was trying to do it in the allocate-copy-deallocate order > (of course, depending on how big a chunk is, but this is going to be > much less than 1GB!).There might be 1GB huge pages which have to be copied at once (especially for PV-domains). Doing a migration to another node for performance reasons and losing huge page advantages at the same time seems to be a bad idea. :-)> Do you see what I mean? Do you think it would be nice to increase the > domain''s "memory allowance" (temporarily, of course) for this to be > possible?Would make sense, I think. :-)> >> - free the memory chunk on the old node(s) >> - repeat until either no memory obtained or move is finished >> >> This will have higher overhead, but the domain will be suspended for only >> short periods of time. The memory requirements don''t matter, as the additional >> memory will be allocated only for a short period of time. >> > Yep, this all makes sense, with the only nit being the max_mem issue > above.Juergen -- Juergen Gross Principal Developer Operating Systems PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html
Dan Magenheimer
2013-Apr-09 17:43 UTC
Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com] > Subject: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory(NUMA discussion...)> > XXX Open issues are: > - TMEM: how to "move" it?(Konrad added to cc list.) Tmem memory is, by definition, the lowest priority memory for the domain and the hypervisor may already be storing it as efficiently as possible (i.e. the page may be deduplicated). When it is accessed by the domain (it is never directly addressable by a domain, and a hypercall is required to access it), an entire page is sequentially copied from a physical page in the hypervisor to the domain. Juergen may know otherwise, but I''d guess this inter-node copy would be very efficiently pipelined, cache-line by cache-line possibly even with hardware pre-fetching. So the best answer to "how to move it?" may be "don''t move it at all!". In fact, a good design for a NUMA-aware implementation of tmem might intentionally store the data on "any node other than the node making this tmem-put hypercall". Dan
Dario Faggioli
2013-Apr-11 14:16 UTC
Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory
On mar, 2013-04-09 at 18:43 +0100, Dan Magenheimer wrote:> > From: Dario Faggioli [mailto:dario.faggioli@citrix.com] > > Subject: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory > > (NUMA discussion...) >Hi Dan,> > > > XXX Open issues are: > > - TMEM: how to "move" it? > > (Konrad added to cc list.) > > Tmem memory is, by definition, the lowest priority memory > for the domain and the hypervisor may already be storing it as > efficiently as possible (i.e. the page may be deduplicated). > When it is accessed by the domain (it is never directly > addressable by a domain, and a hypercall is required > to access it), an entire page is sequentially copied from > a physical page in the hypervisor to the domain. Juergen may > know otherwise, but I''d guess this inter-node copy would be > very efficiently pipelined, cache-line by cache-line > possibly even with hardware pre-fetching. >Ok, thanks for the clarification.> So the best answer to "how to move it?" may be "don''t > move it at all!". >Ok. I sort of got the feeling that "not touching" would have been TRT but, again, thanks for making it clear. :-)> In fact, a good design for a NUMA-aware > implementation of tmem might intentionally store the data on > "any node other than the node making this tmem-put hypercall". >Well, we''ll get there too, sooner or later. For now, and for the purpose of this specific work, I''ll put things in such a way that they live TMEM alone. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Tim Deegan
2013-May-02 14:32 UTC
Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory
Hi, This looks like a promising start. Two thoughts: 1. You currently move memory into a bufferm free it, allocate new memory and restore the contents. Copying directly from old to new would be significantly faster, and you could do it for _most_ batches: - copy old batch 0 to the backup buffer; free old batch 0; - allocate new batch 1; copy batch 1 directly; free old batch 1; ... - allocate new batch n; copy batch n directly; free old batch n; - allocate new batch 0; copy batch 0 from the backup buffer. 2. Clearing all the _PAGE_PRESENT bits with mmu-update hypercalls must be overkill. It ought to be possible to drop those pages'' typecounts to 0 by unpinning them and then resetting all the vcpus. The you should be able to just update the contents with normal writes and re-pin afterwards. Cheers, Tim. At 04:49 +0200 on 09 Apr (1365482951), Dario Faggioli wrote:> as a mechanism of deallocating and reallocating (immediately!) _all_ > the memory of a domain. Notice it relies on the guest being suspended > already, before the function is invoked. > > Of course, it is quite likely that the memory ends up in different > places from where it was before calling it but, for instance, the fact > that this is actually a different NUMA node (or anything else) does not > depend by any means from this function. > > In fact, here the guest pages are just freed and immediately > re-allocated (you can see it as a very quick, back-to-back save-restore > cycle). > > If the current domain configuration says, for instance, that new > allocations should go to a specific NUMA node, then the whole domain > is, as a matter of facts, moved there, but again, this is not > something this function does explicitly. > > The way we do this is, very briefly, as follows: > 1. drop all the references to all the pages of a domain, > 2. backup the content of a batch of pages, > 3. deallocate the a batch, > 4. allocate a new set of pages for the batch, > 5. copy the backed up content in the new pages, > 6. if there are more pages, go back to 2, othwrwise > 7. update the page tables, the vcpu contexts, the P2M, etc. > > The above raises a number of quite complex issues and, _not_all_ > of them are being dealt with or solved in this series (RFC means > something after all, doesn''t it? ;-P). > > XXX Open issues are: > - HVM ("easy" to add, but it''s not in this patch. See the > cover letter for the series); > - PAE guests, as they need special attention for some of > the page tables (should be trivial to add); > - grant tables/granted pages: how to move them? > - TMEM: how to "move" it? > - shared/paged pages: what to do with them? > - guest pages mapped in Xen, for instance: > * vcpu info pages: moved but, how to update the mapping? > * EOI page: moved but, how to update the mapping? > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile > --- a/tools/libxc/Makefile > +++ b/tools/libxc/Makefile > @@ -48,6 +48,11 @@ else > GUEST_SRCS-y += xc_nomigrate.c > endif > > +# XXX: Well, for sure there are some X86-ism in the current code. > +# Making it more ARM friendly should not be a big deal though, > +# will do for next release. > +GUEST_SRCS-$(CONFIG_X86) += xc_domain_movemem.c > + > vpath %.c ../../xen/common/libelf > CFLAGS += -I../../xen/common/libelf > > diff --git a/tools/libxc/xc_domain_movemem.c b/tools/libxc/xc_domain_movemem.c > new file mode 100644 > --- /dev/null > +++ b/tools/libxc/xc_domain_movemem.c > @@ -0,0 +1,766 @@ > +/****************************************************************************** > + * xc_domain_movemem.c > + * > + * Deallocate and reallocate all the memory of a domain. > + * > + * Copyright (c) 2013, Dario Faggioli. > + * Copyright (c) 2012, Citrix Systems, Inc. > + * > + * This library is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; > + * version 2.1 of the License. > + * > + * This library is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with this library; if not, write to the Free Software > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA > + */ > + > +#include <inttypes.h> > +#include <time.h> > +#include <stdlib.h> > +#include <unistd.h> > +#include <sys/time.h> > +#include <xc_core.h> > + > +#include "xc_private.h" > +#include "xc_dom.h" > +#include "xg_private.h" > +#include "xg_save_restore.h" > + > +/* Needed from translation macros in xg_private.h */ > +static struct domain_info_context _dinfo; > +static struct domain_info_context *dinfo = &_dinfo; > + > +#define MAX_BATCH_SIZE 1024 > +#define MAX_PIN_BATCH 1024 > + > +#define MFN_IS_IN_PSEUDOPHYS_MAP(_mfn, _max_mfn, _minfo, _m2p) \ > + (((_mfn) < (_max_mfn)) && ((mfn_to_pfn(_mfn, _m2p) < (_minfo).p2m_size) && \ > + (pfn_to_mfn(mfn_to_pfn(_mfn, _m2p), (_minfo).p2m_table, \ > + (_minfo).guest_width) == (_mfn)))) > + > +/* > + * This is to determine which entries in this page table hold reserved > + * hypervisor mappings. This depends on the current page table type as > + * well as the number of paging levels (see also xc_domain_save.c). > + * > + * XXX: export this function so that it can be used both here and from > + * canonicalize_pagetable(), in xc_domain_save.c. > + */ > +static int is_xen_mapping(struct xc_domain_meminfo *minfo, unsigned long type, > + unsigned long hvirt_start, unsigned long m2p_mfn0, > + const void *spage, int pte) > +{ > + int xen_start, xen_end, pte_last; > + > + xen_start = xen_end = pte_last = PAGE_SIZE / 8; > + > + if ( (minfo->pt_levels == 3) && (type == XEN_DOMCTL_PFINFO_L3TAB) ) > + xen_start = L3_PAGETABLE_ENTRIES_PAE; > + > + /* > + * In PAE only the L2 mapping the top 1GB contains Xen mappings. > + * We can spot this by looking for the guest''s mappingof the m2p. > + * Guests must ensure that this check will fail for other L2s. > + */ > + if ( (minfo->pt_levels == 3) && (type == XEN_DOMCTL_PFINFO_L2TAB) ) > + { > + int hstart; > + uint64_t he; > + > + hstart = (hvirt_start >> L2_PAGETABLE_SHIFT_PAE) & 0x1ff; > + he = ((const uint64_t *) spage)[hstart]; > + > + if ( ((he >> PAGE_SHIFT) & MFN_MASK_X86) == m2p_mfn0 ) > + { > + /* hvirt starts with xen stuff... */ > + xen_start = hstart; > + } > + else if ( hvirt_start != 0xf5800000 ) > + { > + /* old L2s from before hole was shrunk... */ > + hstart = (0xf5800000 >> L2_PAGETABLE_SHIFT_PAE) & 0x1ff; > + he = ((const uint64_t *) spage)[hstart]; > + if ( ((he >> PAGE_SHIFT) & MFN_MASK_X86) == m2p_mfn0 ) > + xen_start = hstart; > + } > + } > + > + if ( (minfo->pt_levels == 4) && (type == XEN_DOMCTL_PFINFO_L4TAB) ) > + { > + /* > + * XXX SMH: should compute these from hvirt_start (which we have) > + * and hvirt_end (which we don''t) > + */ > + xen_start = 256; > + xen_end = 272; > + } > + > + return pte >= xen_start && pte < xen_end; > +} > + > +/* > + * This function will basically deallocate _all_ the memory of a domain and > + * reallocate it immediately. It relies on the guest being suspended > + * already, before the function is even invoked. > + * > + * Of course, it is quite likely that the memory ends up in different places > + * from where it was before calling this but, for instance, the fact that > + * this is actually a different NUMA node (or anything else) does not > + * depend by any means from this function. In fact, here the guest pages are > + * just freed and immediately re-allocated (you can see it as a very quick, > + * back-to-back domain_save--domain_restore). If the current domain > + * configuration says, for instance, that new allocation should go to a > + * different NUMA nodes, then the whole domain is moved to there, but again, > + * this is not something this function does explicitly. > + * > + * If actually interested in doing something like that (i.e., moving the > + * domain to a different NUMA node), calling xc_domain_node_setaffinity() > + * right before this should achieve it. > + */ > +int xc_domain_move_memory(xc_interface *xch, uint32_t domid/*, int hvm*/) > +{ > + unsigned int i, j; > + int rc = 1; > + > + xc_dominfo_t info; > + struct xc_domain_meminfo minfo; > + > + struct mmuext_op pin[MAX_PIN_BATCH]; > + unsigned int nr_pins; > + > + struct xc_mmu *mmu = NULL; > + unsigned int xen_pt_levels, dom_guest_width; > + unsigned long max_mfn, hvirt_start, m2p_mfn0; > + vcpu_guest_context_any_t ctxt; > + > + void *live_p2m_frame_list_list = NULL; > + void *live_p2m_frame_list = NULL; > + > + /* > + * XXX: grant tables & granted pages need to be considered, e.g., > + * using xc_is_page_granted_vX() in xc_offline_page.c to > + * recognise them, etc. > + int gnt_num; > + grant_entry_v1_t *gnttab_v1 = NULL; > + grant_entry_v2_t *gnttab_v2 = NULL; > + */ > + > + void *old_p, *new_p, *backup = NULL; > + unsigned long mfn, pfn; > + uint64_t fll; > + > + xen_pfn_t *new_mfns= NULL, *old_mfns = NULL, *batch_pfns = NULL; > + int pte_num = PAGE_SIZE / 8, cleared_pte = 0; > + xen_pfn_t *m2p_table, *orig_m2p = NULL; > + shared_info_any_t *live_shinfo = NULL; > + > + unsigned long n = 0, n_skip = 0; > + > + int debug = 0; /* XXX will become a parameter */ > + > + if ( !get_platform_info(xch, domid, &max_mfn, &hvirt_start, > + &xen_pt_levels, &dom_guest_width) ) > + { > + ERROR("Failed getting platform info"); > + return 1; > + } > + > + /* We expect domain to be suspende already */ > + if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 ) > + { > + PERROR("Failed getting domain info"); > + return 1; > + } > + if ( !info.shutdown || info.shutdown_reason != SHUTDOWN_suspend) > + { > + PERROR("Domain appears not to be suspended"); > + return 1; > + } > + > + DBGPRINTF("Establishing the mappings for M2P and P2M"); > + memset(&minfo, 0, sizeof(minfo)); > + if ( !(m2p_table = xc_map_m2p(xch, max_mfn, PROT_READ, &m2p_mfn0)) ) > + { > + PERROR("Failed to map the M2P table"); > + return 1; > + } > + if ( xc_map_domain_meminfo(xch, domid, &minfo) ) > + { > + PERROR("Failed to map domain''s memory information"); > + goto out; > + } > + dinfo->guest_width = minfo.guest_width; > + dinfo->p2m_size = minfo.p2m_size; > + > + /* > + * XXX > + DBGPRINTF("Mapping the grant tables"); > + gnttab_v2 = xc_gnttab_map_table_v2(xch, domid, &gnt_num); > + if (!gnttab_v2) > + { > + PERROR("Failed to map V1 grant table... Trying V1"); > + gnttab_v1 = xc_gnttab_map_table_v1(xch, domid, &gnt_num); > + if (!gnttab_v1) > + { > + PERROR("Failed to map grant table"); > + goto out; > + } > + } > + DBGPRINTF("Grant table mapped. %d grants found", gnt_num); > + */ > + > + mmu = xc_alloc_mmu_updates(xch, (domid+1)<<16|domid); > + if ( mmu == NULL ) > + { > + PERROR("Failed to allocate memory for MMU updates"); > + goto out; > + } > + > + /* Alloc support data structures */ > + new_mfns = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t)); > + old_mfns = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t)); > + batch_pfns = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t)); > + > + backup = malloc(PAGE_SIZE * MAX_BATCH_SIZE); > + > + orig_m2p = calloc(max_mfn, sizeof(xen_pfn_t)); > + > + if ( !new_mfns || !old_mfns || !batch_pfns || !backup || !orig_m2p ) > + { > + ERROR("Failed to allocate copying and/or backup data structures"); > + goto out; > + } > + > + DBGPRINTF("Saving the original M2P"); > + memcpy(orig_m2p, m2p_table, max_mfn * sizeof(xen_pfn_t)); > + > + DBGPRINTF("Starting deallocating and reallocating all memory for domain %d" > + "\n\tnr_pages=%lu, nr_shared_pages=%lu, nr_paged_pages=%lu" > + "\n\tnr_online_vcpus=%u, max_vcpu_id=%u", > + domid, info.nr_pages, info.nr_shared_pages, info.nr_paged_pages, > + info.nr_online_vcpus, info.max_vcpu_id); > + > + /* Beware: no going back from this point!! */ > + > + /* > + * As a part of the process of dropping all the references to the existing > + * pages in memory, so that we can free (and then re-allocate them) we need > + * to unpin them. > + * > + * We do that in batches of 1024 PFNs at each step, to amortize the cost > + * of xc_mmuext_op() calls. > + */ > + nr_pins = 0; > + for ( i = 0; i < minfo.p2m_size; i++ ) > + { > + if ( (minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 ) > + continue; > + > + pin[nr_pins].cmd = MMUEXT_UNPIN_TABLE; > + pin[nr_pins].arg1.mfn = minfo.p2m_table[i]; > + nr_pins++; > + > + if ( nr_pins == MAX_PIN_BATCH ) > + { > + if ( xc_mmuext_op(xch, pin, nr_pins, domid) < 0 ) > + { > + PERROR("Failed to unpin a batch of %d MFNs", nr_pins); > + goto out; > + } > + else > + DBGPRINTF("Unpinned a batch of %d MFNs", nr_pins); > + nr_pins = 0; > + } > + } > + if ( (nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, domid) < 0) ) > + { > + PERROR("Failed to unpin a batch of %d MFNs", nr_pins); > + goto out; > + } > + else > + DBGPRINTF("Unpinned a batch of %d MFNs", nr_pins); > + > + /* > + * After unpinning, we also need to remove the _PAGE_PRESENT bit from > + * the domain''s PTEs, for the pages that we want to deallocate, or they > + * just could not go away. > + */ > + for (i = 0; i < minfo.p2m_size; i++) > + { > + void *content; > + xen_pfn_t table_type, table_mfn = pfn_to_mfn(i, minfo.p2m_table, > + minfo.guest_width); > + > + if ( table_mfn == INVALID_P2M_ENTRY || > + minfo.pfn_type[i] == XEN_DOMCTL_PFINFO_XTAB ) > + { > + DBGPRINTF("Broken P2M entry at PFN 0x%x", i); > + continue; > + } > + > + table_type = minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK; > + if ( table_type < XEN_DOMCTL_PFINFO_L1TAB || > + table_type > XEN_DOMCTL_PFINFO_L4TAB ) > + continue; > + > + content = xc_map_foreign_range(xch, domid, PAGE_SIZE, > + PROT_READ, table_mfn); > + if ( !content ) > + { > + PERROR("Failed to map the table at MFN 0x%lx", table_mfn); > + goto out; > + } > + > + /* Go through each PTE of each table and clear the _PAGE_PRESENT bit */ > + for ( j = 0; j < pte_num; j++ ) > + { > + uint64_t pte = ((uint64_t *)content)[j]; > + > + if ( !pte || is_xen_mapping(&minfo, table_type, hvirt_start, m2p_mfn0, content, j) ) > + continue; > + > + if ( debug ) > + DBGPRINTF("Entry %d: PTE=0x%lx, MFN=0x%lx, PFN=0x%lx", j, pte, > + (uint64_t)((pte & MADDR_MASK_X86)>>PAGE_SHIFT), > + m2p_table[(unsigned long)((pte & MADDR_MASK_X86) > + >>PAGE_SHIFT)]); > + > + pfn = m2p_table[(pte & MADDR_MASK_X86)>>PAGE_SHIFT]; > + pte &= ~_PAGE_PRESENT; > + > + if ( xc_add_mmu_update(xch, mmu, table_mfn << PAGE_SHIFT | > + (j * (sizeof(uint64_t))) | > + MMU_PT_UPDATE_PRESERVE_AD, pte) ) > + PERROR("Failed to add some PTE update operation"); > + else > + cleared_pte++; > + } > + > + if (content) > + munmap(content, PAGE_SIZE); > + } > + if ( cleared_pte && xc_flush_mmu_updates(xch, mmu) ) > + { > + PERROR("Failed flushing some PTE update operations"); > + goto out; > + } > + else > + DBGPRINTF("Cleared presence for %d PTEs", cleared_pte); > + > + /* Scan all the P2M ... */ > + while ( n < minfo.p2m_size ) > + { > + /* ... But all operations are done in batches */ > + for ( i = 0; (i < MAX_BATCH_SIZE) && (n < minfo.p2m_size); n++ ) > + { > + xen_pfn_t mfn = pfn_to_mfn(n, minfo.p2m_table, minfo.guest_width); > + xen_pfn_t mfn_type = minfo.pfn_type[n] & XEN_DOMCTL_PFINFO_LTAB_MASK; > + > + if (mfn == INVALID_P2M_ENTRY || !is_mapped(mfn) ) > + { > + if ( debug ) > + DBGPRINTF("Skipping invalid or unmapped MFN 0x%lx", mfn); > + n_skip++; > + continue; > + } > + if ( mfn_type == XEN_DOMCTL_PFINFO_BROKEN || > + mfn_type == XEN_DOMCTL_PFINFO_XTAB || > + mfn_type == XEN_DOMCTL_PFINFO_XALLOC ) > + { > + if ( debug ) > + DBGPRINTF("Skippong broken or alloc only MFN 0x%lx", mfn); > + n_skip++; > + continue; > + } > + > + /* > + if ( gnttab_v1 ? > + xc_is_page_granted_v1(xch, mfn, gnttab_v1, gnt_num) : > + xc_is_page_granted_v2(xch, mfn, gnttab_v2, gnt_num) ) > + { > + n_skip++; > + continue; > + } > + */ > + > + old_mfns[i] = mfn; > + batch_pfns[i] = n; > + i++; > + } > + > + /* Was the batch empty? */ > + if ( i == 0) > + continue; > + > + /* > + * And now the core of the whole thing: map the PFNs in the batch, > + * backup them, allocate new pages for them, and copy them there. > + * We do this in this order, and we pass through a local backup, > + * because we don''t want to risk hitting the max_mem limit for > + * the domain (which would be possible, depending on MAX_BATCH_SIZE, > + * if we try to do it like allocate->copy->deallocate). > + * > + * With MAX_BATCH_SIZE of 1024 and 4K pages, this means we are moving > + * 4MB of guest memory for each batch. > + */ > + > + /* Map and backup */ > + old_p = xc_map_foreign_pages(xch, domid, PROT_READ, old_mfns, i); > + if ( !old_p ) > + { > + PERROR("Failed mapping the current MFNs\n"); > + goto out; > + } > + memcpy(backup, old_p, PAGE_SIZE * i); > + munmap(old_p, PAGE_SIZE * i); > + > + /* Deallocation and re-allocation */ > + if ( xc_domain_decrease_reservation(xch, domid, i, 0, old_mfns) != i || > + xc_domain_populate_physmap_exact(xch, domid, i, 0, 0, new_mfns) ) > + { > + PERROR("Failed making space or allocating the new MFNs\n"); > + munmap(backup, PAGE_SIZE * i); > + goto out; > + } > + > + /* Map of new pages, copy content and unmap */ > + new_p = xc_map_foreign_pages(xch, domid, PROT_WRITE, new_mfns, i); > + if ( !new_p ) > + { > + PERROR("Failed mapping the new MFNs\n"); > + munmap(backup, PAGE_SIZE * i); > + goto out; > + } > + memcpy(new_p, backup, PAGE_SIZE * i); > + munmap(new_p, PAGE_SIZE * i); > + munmap(backup, PAGE_SIZE * i); > + > + /* > + * Since we already have the new MFNs, we can update both the M2P > + * and the P2M right here, within this same loop. > + */ > + for ( j = 0; j < i; j++ ) > + { > + minfo.p2m_table[batch_pfns[j]] = new_mfns[j]; > + if ( xc_add_mmu_update(xch, mmu, > + (((uint64_t)new_mfns[j]) << PAGE_SHIFT) | > + MMU_MACHPHYS_UPDATE, batch_pfns[j]) ) > + { > + PERROR("Failed updating M2P\n"); > + goto out; > + } > + } > + if ( xc_flush_mmu_updates(xch, mmu) ) > + { > + PERROR("Failed updating M2P\n"); > + goto out; > + } > + > + DBGPRINTF("Batch %lu/%ld done (%lu pages skipped)", > + n / MAX_BATCH_SIZE, minfo.p2m_size / MAX_BATCH_SIZE, n_skip); > + } > + > + /* > + * Finally (oh, well...) update the PTEs of the domain again, putting > + * the new MFNs there, and making the entries _PAGE_PRESENT again. > + * > + * This is a kind-of uncanonicalization, like it happens in save-resrote, > + * although a very special one, and we rely on the snapshot of the M2P > + * we made before starting all the deallocation/reallocation process. > + */ > + for ( i = 0; i < minfo.p2m_size; i++ ) > + { > + void *content; > + xen_pfn_t table_type, table_mfn = pfn_to_mfn(i, minfo.p2m_table, > + minfo.guest_width); > + > + table_type = minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK; > + if ( table_type < XEN_DOMCTL_PFINFO_L1TAB || > + table_type > XEN_DOMCTL_PFINFO_L4TAB ) > + continue; > + > + /* We of course only care about tables */ > + content = xc_map_foreign_range(xch, domid, PAGE_SIZE, > + PROT_WRITE, table_mfn); > + if ( !content ) > + { > + PERROR("Failed to map the table at MFN 0x%lx", table_mfn); > + continue; > + } > + > + for ( j = 0; j < PAGE_SIZE / 8; j++ ) > + { > + uint64_t pte = ((uint64_t *)content)[j]; > + > + if ( !pte || is_xen_mapping(&minfo, table_type, hvirt_start, m2p_mfn0, content, j) ) > + continue; > + > + /* > + * Basically, we lookup the PFN from the snapshoted M2P and we > + * pick up the new MFN from the P2M (since we updated it "live" > + * during the re-allocation phase above). > + */ > + mfn = (pte >> PAGE_SHIFT) & MFN_MASK_X86; > + pfn = orig_m2p[mfn]; > + > + if ( debug ) > + DBGPRINTF("Table[PTE]: 0x%lx[%d] ==> orig_m2p[0x%lx]=0x%lx, " > + "p2m[0x%lx]=0x%lx // pte: 0x%lx --> 0x%lx", > + table_mfn, j, mfn, pfn, pfn, minfo.p2m_table[pfn], > + pte, (uint64_t)((pte & ~MADDR_MASK_X86)| > + (minfo.p2m_table[pfn]<<PAGE_SHIFT)| > + _PAGE_PRESENT)); > + > + mfn = minfo.p2m_table[pfn]; > + pte &= ~MADDR_MASK_X86; > + pte |= (uint64_t)mfn << PAGE_SHIFT; > + pte |= _PAGE_PRESENT; > + > + ((uint64_t *)content)[j] = pte; > + > + if ( !MFN_IS_IN_PSEUDOPHYS_MAP(mfn, max_mfn, minfo, m2p_table) ) > + { > + ERROR("Failed updating entry %d in table at MFN 0x%lx", j, table_mfn); > + continue; // XXX > + } > + } > + > + if ( content ) > + munmap(content, PAGE_SIZE); > + } > + > + DBGPRINTF("Re-pinning page table MFNs"); > + > + /* Pin the able types again */ > + nr_pins = 0; > + for ( i = 0; i < minfo.p2m_size; i++ ) > + { > + if ( (minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 ) > + continue; > + > + switch ( minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) > + { > + case XEN_DOMCTL_PFINFO_L1TAB: > + pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE; > + break; > + > + case XEN_DOMCTL_PFINFO_L2TAB: > + pin[nr_pins].cmd = MMUEXT_PIN_L2_TABLE; > + break; > + > + case XEN_DOMCTL_PFINFO_L3TAB: > + pin[nr_pins].cmd = MMUEXT_PIN_L3_TABLE; > + break; > + > + case XEN_DOMCTL_PFINFO_L4TAB: > + pin[nr_pins].cmd = MMUEXT_PIN_L4_TABLE; > + break; > + default: > + continue; > + } > + pin[nr_pins].arg1.mfn = minfo.p2m_table[i]; > + nr_pins++; > + > + if ( nr_pins == MAX_PIN_BATCH ) > + { > + if ( xc_mmuext_op(xch, pin, nr_pins, domid) < 0 ) > + { > + PERROR("Failed to pin a batch of %d MFNs", nr_pins); > + goto out; > + } > + else > + DBGPRINTF("Re-pinned a batch of %d MFNs", nr_pins); > + nr_pins = 0; > + } > + } > + if ( (nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, domid) < 0) ) > + { > + PERROR("Failed to pin batch of %d page tables", nr_pins); > + goto out; > + } > + else > + DBGPRINTF("Re-pinned a batch of %d MFNs", nr_pins); > + > + /* > + * Now, take care of the vCPUs contextes. It all happens as above, > + * we use the original M2P and the new domain''s P2M to update all > + * the various references. > + */ > + for ( i = 0; i <= info.max_vcpu_id; i++ ) > + { > + xc_vcpuinfo_t vinfo; > + > + DBGPRINTF("Adjusting context for VCPU%d", i); > + > + if ( xc_vcpu_getinfo(xch, domid, i, &vinfo) ) > + { > + PERROR("Failed getting info for VCPU%d", i); > + goto out; > + } > + if ( !vinfo.online ) > + { > + DBGPRINTF("VCPU%d seems offline", i); > + continue; > + } > + > + if ( xc_vcpu_getcontext(xch, domid, i, &ctxt) ) > + { > + PERROR("No context for VCPU%d", i); > + goto out; > + } > + > + if ( i == 0 ) > + { > + //start_info_any_t *start_info; > + > + /* > + * Update the start info frame number. It is the 3rd argument > + * to the HYPERVISOR_sched_op hypercall when op is > + * SCHEDOP_shutdown and reason is SHUTDOWN_suspend, so we find > + * it in EDX. > + */ > + mfn = GET_FIELD(&ctxt, user_regs.edx); > + mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)]; > + SET_FIELD(&ctxt, user_regs.edx, mfn); > + > + /* > + * XXX: I checke, and store_mfn and console_mfn seemed ok, at > + * least from a ''mapping'' point of view, but more testing is > + * needed. > + start_info = xc_map_foreign_range(xch, domid, PAGE_SIZE, PROT_READ | PROT_WRITE, mfn); > + munmap(start_info, PAGE_SIZE); > + */ > + } > + > + /* GDT pointing MFNs */ > + for ( j = 0; (512*j) < GET_FIELD(&ctxt, gdt_ents); j++ ) > + { > + mfn = GET_FIELD(&ctxt, gdt_frames[j]); > + mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)]; > + SET_FIELD(&ctxt, gdt_frames[j], mfn); > + } > + > + /* CR3 XXX: PAE needs special attenion here, I think */ > + mfn = UNFOLD_CR3(GET_FIELD(&ctxt, ctrlreg[3])); > + mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)]; > + SET_FIELD(&ctxt, ctrlreg[3], FOLD_CR3(mfn)); > + > + /* Guest pagetable (x86/64) in CR1 */ > + if ( (minfo.pt_levels == 4) && ctxt.x64.ctrlreg[1] ) > + { > + /* > + * XXX: save-restore code mangle with the least-significant > + * bit (''valid PFN''). This should not be needed in here. > + */ > + mfn = UNFOLD_CR3(ctxt.x64.ctrlreg[1]); > + mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)]; > + ctxt.x64.ctrlreg[1] = FOLD_CR3(mfn); > + } > + > + /* > + * XXX: Xen refuses to set a new context for an existing vCPU if > + * things like CR3, the GDTs have changed, even if the domain > + * is suspended. Going through re-initializing the vCPU (by > + * this one call below with a NULL ctxt) makes it possible, > + * but is that sensible? And even if yes, is that the following > + * _setcontext call issued below enough? > + */ > + if ( xc_vcpu_setcontext(xch, domid, i, NULL) ) > + { > + PERROR("Failed re-initialising VCPU%d", i); > + goto out; > + } > + if ( xc_vcpu_setcontext(xch, domid, i, &ctxt) ) > + { > + PERROR("Failed when updating context for VCPU%d", i); > + goto out; > + } > + } > + > + /* > + * Finally (an this time for real), we take care of the pages mapping > + * the P2M, and of the P2M entries themselves. > + */ > + > + live_shinfo = xc_map_foreign_range(xch, domid, > + PAGE_SIZE, PROT_READ|PROT_WRITE, info.shared_info_frame); > + if ( !live_shinfo ) > + { > + PERROR("Failed mapping live_shinfo"); > + goto out; > + } > + > + fll = GET_FIELD(live_shinfo, arch.pfn_to_mfn_frame_list_list); > + fll = minfo.p2m_table[mfn_to_pfn(fll, orig_m2p)]; > + live_p2m_frame_list_list = xc_map_foreign_range(xch, domid, PAGE_SIZE, > + PROT_READ|PROT_WRITE, fll); > + if ( !live_p2m_frame_list_list ) > + { > + PERROR("Couldn''t map live_p2m_frame_list_list"); > + goto out; > + } > + SET_FIELD(live_shinfo, arch.pfn_to_mfn_frame_list_list, fll); > + > + /* First, update the frames caontaining the list of the P2M frames */ > + for ( i = 0; i < P2M_FLL_ENTRIES; i++ ) > + { > + > + mfn = ((uint64_t *)live_p2m_frame_list_list)[i]; > + mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)]; > + ((uint64_t *)live_p2m_frame_list_list)[i] = mfn; > + } > + > + live_p2m_frame_list > + xc_map_foreign_pages(xch, domid, PROT_READ|PROT_WRITE, > + live_p2m_frame_list_list, > + P2M_FLL_ENTRIES); > + if ( !live_p2m_frame_list ) > + { > + PERROR("Couldn''t map live_p2m_frame_list"); > + goto out; > + } > + > + /* And then update the actual entries of it */ > + for ( i = 0; i < P2M_FL_ENTRIES; i++ ) > + { > + mfn = ((uint64_t *)live_p2m_frame_list)[i]; > + mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)]; > + ((uint64_t *)live_p2m_frame_list)[i] = mfn; > + } > + > + rc = 0; > + > + out: > + if ( live_p2m_frame_list_list ) > + munmap(live_p2m_frame_list_list, PAGE_SIZE); > + if ( live_p2m_frame_list ) > + munmap(live_p2m_frame_list, P2M_FLL_ENTRIES * PAGE_SIZE); > + if ( live_shinfo ) > + munmap(live_shinfo, PAGE_SIZE); > + > + free(mmu); > + free(new_mfns); > + free(old_mfns); > + free(batch_pfns ); > + free(backup); > + free(orig_m2p); > + > + /* > + if (gnttab_v1) > + munmap(gnttab_v1, gnt_num / (PAGE_SIZE/sizeof(grant_entry_v1_t))); > + if (gnttab_v2) > + munmap(gnttab_v2, gnt_num / (PAGE_SIZE/sizeof(grant_entry_v2_t))); > + */ > + > + xc_unmap_domain_meminfo(xch, &minfo); > + munmap(m2p_table, M2P_SIZE(max_mfn)); > + > + return !!rc; > +} > diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h > --- a/tools/libxc/xenguest.h > +++ b/tools/libxc/xenguest.h > @@ -272,6 +272,15 @@ int xc_query_page_offline_status(xc_inte > > int xc_exchange_page(xc_interface *xch, int domid, xen_pfn_t mfn); > > +/** > + * This function deallocates all the guests memory and allocates it > + * again and immediately, with the net effect of moving it somewhere > + * else wrt where it is when the function is invoked. > + * > + * @param xch a handle to an open hypervisor interface. > + * @param domid the domain id one wants to move the memory of. > + */ > +int xc_domain_move_memory(xc_interface *xch, uint32_t domid/*, int hvm*/); > > /** > * Memory related information, such as PFN types, the P2M table, > diff --git a/tools/libxc/xg_private.h b/tools/libxc/xg_private.h > --- a/tools/libxc/xg_private.h > +++ b/tools/libxc/xg_private.h > @@ -145,6 +145,11 @@ static inline xen_pfn_t pfn_to_mfn(xen_p > (((uint32_t *)p2m)[(pfn)])))); > } > > +static inline xen_pfn_t mfn_to_pfn(xen_pfn_t mfn, xen_pfn_t *m2p) > +{ > + return m2p[mfn]; > +} > + > /* Number of xen_pfn_t in a page */ > #define FPP (PAGE_SIZE/(dinfo->guest_width)) > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
George Dunlap
2013-May-02 15:07 UTC
Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory
On 02/05/13 15:32, Tim Deegan wrote:> Hi, > > This looks like a promising start. Two thoughts: > > 1. You currently move memory into a bufferm free it, allocate new memory > and restore the contents. Copying directly from old to new would be > significantly faster, and you could do it for _most_ batches: > - copy old batch 0 to the backup buffer; free old batch 0; > - allocate new batch 1; copy batch 1 directly; free old batch 1; > ... > - allocate new batch n; copy batch n directly; free old batch n; > - allocate new batch 0; copy batch 0 from the backup buffer.Hmm -- isn''t it the case that if there is not *free* memory lying around somewhere, then this operation is fairly pointless? What will happen is that after freeing batch 0, "allocate new batch 1" will get that memory. So copying it to a temporary buffer in dom0 seems like not a particularly useful thing to do -- it should try to allocate a new buffer to copy into directly, and if that fails, just say "No point trying -- no empty memory to move into." Unless of course we were trying to do this to two (or more) VMs at the same time, but that seems like the next level. -George
Tim Deegan
2013-May-02 15:13 UTC
Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory
At 16:07 +0100 on 02 May (1367510834), George Dunlap wrote:> On 02/05/13 15:32, Tim Deegan wrote: > >Hi, > > > >This looks like a promising start. Two thoughts: > > > >1. You currently move memory into a bufferm free it, allocate new memory > > and restore the contents. Copying directly from old to new would be > > significantly faster, and you could do it for _most_ batches: > > - copy old batch 0 to the backup buffer; free old batch 0; > > - allocate new batch 1; copy batch 1 directly; free old batch 1; > > ... > > - allocate new batch n; copy batch n directly; free old batch n; > > - allocate new batch 0; copy batch 0 from the backup buffer. > > Hmm -- isn''t it the case that if there is not *free* memory lying around > somewhere, then this operation is fairly pointless? What will happen is > that after freeing batch 0, "allocate new batch 1" will get that > memory. So copying it to a temporary buffer in dom0 seems like not a > particularly useful thing to do -- it should try to allocate a new > buffer to copy into directly, and if that fails, just say "No point > trying -- no empty memory to move into."Sure, that''s better, as long as the temporary bump in the VM''s max_pages is acceptable to the rest of the toolstack. :) Tim.
Dario Faggioli
2013-May-06 17:29 UTC
Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory
On gio, 2013-05-02 at 15:32 +0100, Tim Deegan wrote:> Hi, >Hi Tim, Thanks for looking at this! :-)> This looks like a promising start. Two thoughts: > > 1. You currently move memory into a bufferm free it, allocate new memory > and restore the contents. Copying directly from old to new would be > significantly faster, and you could do it for _most_ batches: > - copy old batch 0 to the backup buffer; free old batch 0; > - allocate new batch 1; copy batch 1 directly; free old batch 1; > ... > - allocate new batch n; copy batch n directly; free old batch n; > - allocate new batch 0; copy batch 0 from the backup buffer. >I see what you mean, and I think it''s feasible. One thing I noticed (and not yet tracked down properly, actually) is some sort of "latency" in freeing the pages... I''ll investigate that better and go for what you suggest if possible.> 2. Clearing all the _PAGE_PRESENT bits with mmu-update > hypercalls must be overkill. It ought to be possible to drop > those pages'' typecounts to 0 by unpinning them and then resetting all > the vcpus. The you should be able to just update the contents > with normal writes and re-pin afterwards. >Yeah, I thought the same, but haven''t found a sensible way of making that happen yet. However, the ''reset all vcpus'' thing definitely needs more attention (and I''m investigating it right in these days). I''ll keep digging and let you know what I find. Thanks again, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-May-06 17:37 UTC
Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory
On gio, 2013-05-02 at 16:13 +0100, Tim Deegan wrote:> At 16:07 +0100 on 02 May (1367510834), George Dunlap wrote: > > > Hmm -- isn''t it the case that if there is not *free* memory lying around > > somewhere, then this operation is fairly pointless? What will happen is > > that after freeing batch 0, "allocate new batch 1" will get that > > memory. So copying it to a temporary buffer in dom0 seems like not a > > particularly useful thing to do -- it should try to allocate a new > > buffer to copy into directly, and if that fails, just say "No point > > trying -- no empty memory to move into." >George, good point, checking for free memory is something I did not think about, but it''s necessary for this while thing to be meaningful. This could be tricky to do in the right way, due to the well known races we have when dealing with memory at the toolstack level, but I''ll give it a thought, thanks. :-) However...> Sure, that''s better, as long as the temporary bump in the VM''s max_pages > is acceptable to the rest of the toolstack. :) >... This that Tim is saying is the main reason I''m going through a temporary buffer in Dom0: I can''t be sure that, if failing allocating more memory for the domain before freeing it, that comes from the host being actually out-of-memory, or from the fact that I''m hitting max_pages. That''s why I went for the "deallocate first" approach. I can investigate what temporarily bumping the page limit could mean, but I think I like what Tim proposed in his first e-mail better... Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel