thr3ads.net - Xen devel - [PATCH 0 of 8 [RFC]] Move all the memory of a domain [Apr 2013]

If this information is useful, please help other people find it:
Share via:

Dario Faggioli

2013-Apr-09 02:49 UTC

[PATCH 0 of 8 [RFC]] Move all the memory of a domain

Hi everyone,

It''s been a while now since I started working on trying to implement a
mechanism for moving all the memory of a domain from one NUMA node to another.
Yes, that is part of the more general work about improving Xen NUMA support
I''m
carrying on... For more details, see here:

 http://wiki.xen.org/wiki/Xen_NUMA_Roadmap.

The approach I decided to take is to mimic a sort of back-to-back save/restore.
In some more details, I''m suspending a domain, deallocating all its
memory,
reallocating it to different places (especially, for instance, if the NUMA
node-affinity changed in the meanwhile), update the domain''s and the
Xen''s
address translation tables, and resume the domain back. Easy eh? :-D

All the above happens at the libxc level, although the patch series provides
all the glue code needed to interact with the new feature from both libxl and
xl.

Also, consider that this fisrt series focuses on PV guests. For HVM, some
"if
(hvm)" here and there will do the trick, together, of course, with the
proper
updating of HAP tables, etc. ... Still much less work than all the tweaking
required by PV-guests! I''ll include more HVM bits in future releases of
this
series, but, in case you have, do feel free to  provide you comments on that
aspect too, even right now.

I got sidetracked and distracted many times, and I have to admit, this is not
quite a done job yet. However, I reached the point where, at least part of what
I have can be shown here, so that you can provide some early feedback on it,
and help me proceed further, with future design choices and implementation
steps.

I have to say I find it quite challenging as, especially for PV, it touches and
exercises a lot of code paths and features I''m not yet so much familiar
with.
That is why feedback is really important, even if the thing is still at an
early stage. For instances, discussing how to properly deal with things like
grant tables, or TMEM, or how to make sure we do not mess up vCPU contexts,
would be really great. Despite the RFC status, I did my best in trying to
facilitate that, both when writing the code and the comments/changelogs for
each patch... For instance, I''ve put some ''XXX''
marked spots where I thought
something was missing and/or commenting is most needed. If you find 5 minutes
to look into them, that would be much appreciated. :-)

I know we''re in a very particular moment, due to 4.3 freeze, so I
understand if
people are busy finalizing the existing and proposed features, instead of
reviewing RFCs for new ones, but I still felt like it would be worthwhile to
send this out. Let''s see if anyone take this chance of telling me how
bad it
looks! ;-P

About the series, the patches on which to concentrate on are, especially at
this stage:

 6/8 libxc: introduce xc_domain_move_memory
 7/8 libxl: introduce libxl_domain_move_memory

The others are introducing minor changes, ancillary to the two above.

Thanks in advance and Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Dario Faggioli

2013-Apr-09 02:49 UTC

head link

[PATCH 1 of 8 [RFC]] xl: allow for node-wise specification of vcpu pinning

Making it possible to use something like the following:
 * "nodes:0-3": all pCPUs of nodes 0,1,2,3;
 * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3;
 * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2
   but not pCPU 6;
 * ...

In both domain config file and `xl vcpu-pin''.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -125,6 +125,26 @@ run on cpu #3 of the host.
 
 =back
 
+A C<CPU-LIST> may also be specified NUMA node-wise as follows:
+
+=over 4
+
+=item "nodes:all"
+
+To allow all the vcpus of the guest to run on all the cpus of all the NUMA
+nodes of the host.
+
+=item "nodes:0-3,node:^2"
+
+To allow all the vcpus of the guest to run on the cpus belonging to
+the NUMA nodes 0,1,3 of the host.
+
+=back
+
+Combining the two is allowed. For instance, "1,node:2,^6" means all
the
+vcpus of the guest will run on cpus 1 and on all the cpus of NUMA node 2,
+but not on cpu 6.
+
 If this option is not specified, libxl automatically tries to place the new
 domain on the host''s NUMA nodes (provided the host has more than one
NUMA
 node) by pinning it to the cpus of those nodes. A heuristic approach is
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -504,61 +504,99 @@ static void split_string_into_string_lis
     free(s);
 }
 
+static int range_parse_bitmap(const char *str, libxl_bitmap *map)
+{
+    char *nstr, *endptr;
+    uint32_t ida, idb;
+
+    ida = idb = strtoul(str, &endptr, 10);
+    if (endptr == str)
+        return EINVAL;
+
+    if (*endptr == ''-'') {
+        nstr = endptr + 1;
+        idb = strtoul(nstr, &endptr, 10);
+        if (endptr == nstr)
+            return EINVAL;
+    }
+
+    libxl_bitmap_set_none(map);
+    while (ida <= idb) {
+        libxl_bitmap_set(map, ida);
+        ida++;
+    }
+
+    return 0;
+}
+
 static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap)
 {
-    libxl_bitmap exclude_cpumap;
-    uint32_t cpuida, cpuidb;
-    char *endptr, *toka, *tokb, *saveptr = NULL;
-    int i, rc = 0, rmcpu;
-
-    if (!strcmp(cpu, "all")) {
+    libxl_bitmap map, cpu_nodemap, *this_map;
+    char *ptr, *saveptr = NULL;
+    bool isnot, isnode;
+    int i, rc = 0;
+
+    if (!strcmp(cpu, "all") || !strcmp(cpu, "nodes:all")) {
         libxl_bitmap_set_any(cpumap);
         return 0;
     }
 
-    if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0)) {
-        fprintf(stderr, "Error: Failed to allocate cpumap.\n");
-        return ENOMEM;
-    }
-
-    for (toka = strtok_r(cpu, ",", &saveptr); toka;
-         toka = strtok_r(NULL, ",", &saveptr)) {
-        rmcpu = 0;
-        if (*toka == ''^'') {
-            /* This (These) Cpu(s) will be removed from the map */
-            toka++;
-            rmcpu = 1;
-        }
-        /* Extract a valid (range of) cpu(s) */
-        cpuida = cpuidb = strtoul(toka, &endptr, 10);
-        if (endptr == toka) {
+    libxl_bitmap_init(&map);
+    libxl_bitmap_init(&cpu_nodemap);
+
+    rc = libxl_node_bitmap_alloc(ctx, &cpu_nodemap, 0);
+    if (rc) {
+        fprintf(stderr, "libxl_node_bitmap_alloc failed.\n");
+        goto out;
+    }
+    rc = libxl_cpu_bitmap_alloc(ctx, &map, 0);
+    if (rc) {
+        fprintf(stderr, "libxl_cpu_bitmap_alloc failed.\n");
+        goto out;
+    }
+
+    for (ptr = strtok_r(cpu, ",", &saveptr); ptr;
+         ptr = strtok_r(NULL, ",", &saveptr)) {
+        isnot = isnode = false;
+
+        /* Are we dealing with cpus or nodes? */
+        if (!strncmp(ptr, "node:", 5) || !strncmp(ptr,
"nodes:", 6)) {
+            isnode = true;
+            ptr += 5 + (ptr[4] == ''s'');
+        }
+        /* Are we adding or removing cpus/nodes? */
+        if (*ptr == ''^'') {
+            isnot = true;
+            ptr++;
+        }
+        /* Get in map a bitmap representative of the range */
+        if (range_parse_bitmap(ptr, &map)) {
             fprintf(stderr, "Error: Invalid argument.\n");
             rc = EINVAL;
-            goto vcpp_out;
-        }
-        if (*endptr == ''-'') {
-            tokb = endptr + 1;
-            cpuidb = strtoul(tokb, &endptr, 10);
-            if (endptr == tokb || cpuida > cpuidb) {
-                fprintf(stderr, "Error: Invalid argument.\n");
-                rc = EINVAL;
-                goto vcpp_out;
+            goto out;
+        }
+
+        /* Add or remove the specified cpus */
+        if (isnode) {
+            rc = libxl_nodemap_to_cpumap(ctx, &map, &cpu_nodemap);
+            if (rc) {
+                fprintf(stderr, "libxl_nodemap_to_cpumap failed.\n");
+                goto out;
             }
-        }
-        while (cpuida <= cpuidb) {
-            rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) :
-                         libxl_bitmap_set(&exclude_cpumap, cpuida);
-            cpuida++;
-        }
-    }
-
-    /* Clear all the cpus from the removal list */
-    libxl_for_each_set_bit(i, exclude_cpumap) {
-        libxl_bitmap_reset(cpumap, i);
-    }
-
-vcpp_out:
-    libxl_bitmap_dispose(&exclude_cpumap);
+            this_map = &cpu_nodemap;
+        } else {
+            this_map = &map;
+        }
+
+        libxl_for_each_set_bit(i, *this_map) {
+            isnot ? libxl_bitmap_reset(cpumap, i)
+                  : libxl_bitmap_set(cpumap, i);
+        }
+    }
+
+ out:
+    libxl_bitmap_dispose(&map);
+    libxl_bitmap_dispose(&cpu_nodemap);
 
     return rc;
 }

Dario Faggioli

2013-Apr-09 02:49 UTC

head link

[PATCH 2 of 8 [RFC]] xl: allow for changing NUMA node affinity on-line

by implementing the "node-affinity" command, acting pretty much like
"vcpu-pin", although it of course affects node and not vcpu affinity.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -626,6 +626,29 @@ different run state is appropriate.  Pin
 this, by ensuring certain VCPUs can only run on certain physical
 CPUs.
 
+=item B<node-affinity> I<domain-id> I<nodes>
+
+Sets or changes the NUMA node affinity for the domain. All the future
+memory allocations for the domain will use memory belonging to I<nodes>.
+Also (if the credit scheduler is in use), the VCPUs of the domain will
+run on the CPUs belonging to I<nodes> as much as possible.
+
+This is different than VCPU pinning, as VCPUs are not prohibited to run
+on CPUs not belonging to I<nodes>, and that can happen, for instance, in
+order to avoid having VCPUs waiting to run in some PCPU''s runqueue
when
+other PCPUs are idle.
+
+Changing a domain''s node-affinity does not affect all the memory that
+has been allocated already, before the command is invoked.
+
+The keyword B<all> can be used to have the domain affine to all the
+NUMA nodes in the host. The keyword B<none> can be used to reset the
+node affinity. In that case, and from that point on, the node affinity
+of the domain will be automatically calculated basing on its vcpu affinity
+(see B<vcpu-pin> above). More specifically, the node affinity will be
+constituted by the nodes to which the physical CPUs its VCPUs have
+vcpu affinity with belong.
+
 =item B<vm-list>
 
 Prints information about guests. This list excludes information about
diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h
--- a/tools/libxl/xl.h
+++ b/tools/libxl/xl.h
@@ -58,6 +58,7 @@ int main_vm_list(int argc, char **argv);
 int main_create(int argc, char **argv);
 int main_config_update(int argc, char **argv);
 int main_button_press(int argc, char **argv);
+int main_nodeaffinity(int argc, char **argv);
 int main_vcpupin(int argc, char **argv);
 int main_vcpuset(int argc, char **argv);
 int main_memmax(int argc, char **argv);
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -601,6 +601,54 @@ static int vcpupin_parse(char *cpu, libx
     return rc;
 }
 
+static int nodeaffinity_parse(char *node, libxl_bitmap *nodemap)
+{
+    char *ptr, *saveptr = NULL;
+    int i, rc = 0, isnot;
+    libxl_bitmap map;
+
+    if (!strcmp(node, "all")) {
+        libxl_bitmap_set_any(nodemap);
+        return 0;
+    } else if (!strcmp(node, "none")) {
+        libxl_bitmap_set_none(nodemap);
+        return 0;
+    }
+
+    rc = libxl_node_bitmap_alloc(ctx, &map, 0);
+    if (rc) {
+        fprintf(stderr, "Error: Failed to allocate nodemap.\n");
+        goto out;
+    }
+
+    for (ptr = strtok_r(node, ",", &saveptr); ptr;
+         ptr = strtok_r(NULL, ",", &saveptr)) {
+        isnot = false;
+
+        /* Adding or removing nodes? */
+        if (*ptr == ''^'') {
+            isnot = true;
+            ptr++;
+        }
+        /* Get in map a bitmap representative of the range */
+        if (range_parse_bitmap(ptr, &map)) {
+            fprintf(stderr, "Error: Invalid argument.\n");
+            rc = EINVAL;
+            goto out;
+        }
+
+        libxl_for_each_set_bit(i, map) {
+            isnot ? libxl_bitmap_reset(nodemap, i)
+                  : libxl_bitmap_set(nodemap, i);
+        }
+    }
+
+ out:
+    libxl_bitmap_dispose(&map);
+
+    return rc;
+}
+
 static void parse_config_data(const char *config_source,
                               const char *config_data,
                               int config_len,
@@ -4583,6 +4631,39 @@ int main_vcpuset(int argc, char **argv)
     return 0;
 }
 
+static void nodeaffinity(uint32_t domid, char *node)
+{
+    libxl_bitmap nodemap;
+
+    if (libxl_node_bitmap_alloc(ctx, &nodemap, 0)) {
+        fprintf(stderr, "libxl_node_bitmap_alloc failed.\n");
+        goto out;
+    }
+
+    if (nodeaffinity_parse(node, &nodemap)) {
+        fprintf(stderr, "Could not parse node affinity.\n");
+        goto out;
+    }
+
+    if (libxl_domain_set_nodeaffinity(ctx, domid, &nodemap) == -1)
+        fprintf(stderr, "Could not set node affinity for dom
`%d''.\n", domid);
+
+ out:
+    libxl_bitmap_dispose(&nodemap);
+}
+
+int main_nodeaffinity(int argc, char **argv)
+{
+    int opt;
+
+    SWITCH_FOREACH_OPT(opt, "", NULL, "node-affinity", 2) {
+        /* No options */
+    }
+
+    nodeaffinity(find_domain(argv[optind]), argv[optind+1]);
+    return 0;
+}
+
 static void output_xeninfo(void)
 {
     const libxl_version_info *info;
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -214,6 +214,11 @@ struct cmd_spec cmd_table[] = {
       "Set which CPUs a VCPU can use",
       "<Domain> <VCPU|all> <CPUs|all>",
     },
+    { "node-affinity",
+      &main_nodeaffinity, 0, 1,
+      "Set the NUMA node affinity for the domain",
+      "<Domain> [<NODEs|all|none>]",
+    },
     { "vcpu-set",
       &main_vcpuset, 0, 1,
       "Set the number of active VCPUs allowed for the domain",

Dario Faggioli

2013-Apr-09 02:49 UTC

head link

[PATCH 3 of 8 [RFC]] libxc: introduce xc_domain_get_address_size

As a wrapper to XEN_DOMCTL_get_address_size, and use it
wherever the call was being issued directly via do_domctl(),
saving quite some line of code.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxc/xc_core.c b/tools/libxc/xc_core.c
--- a/tools/libxc/xc_core.c
+++ b/tools/libxc/xc_core.c
@@ -417,24 +417,6 @@ elfnote_dump_format_version(xc_interface
     return dump_rtn(xch, args, (char*)&format_version,
sizeof(format_version));
 }
 
-static int
-get_guest_width(xc_interface *xch,
-                uint32_t domid,
-                unsigned int *guest_width)
-{
-    DECLARE_DOMCTL;
-
-    memset(&domctl, 0, sizeof(domctl));
-    domctl.domain = domid;
-    domctl.cmd = XEN_DOMCTL_get_address_size;
-
-    if ( do_domctl(xch, &domctl) != 0 )
-        return 1;
-        
-    *guest_width = domctl.u.address_size.size / 8;
-    return 0;
-}
-
 int
 xc_domain_dumpcore_via_callback(xc_interface *xch,
                                 uint32_t domid,
@@ -478,11 +460,12 @@ xc_domain_dumpcore_via_callback(xc_inter
     struct xc_core_section_headers *sheaders = NULL;
     Elf64_Shdr *shdr;
  
-    if ( get_guest_width(xch, domid, &dinfo->guest_width) != 0 )
+    if ( xc_domain_get_address_size(xch, domid, &dinfo->guest_width) !=
0 )
     {
         PERROR("Could not get address size for domain");
         return sts;
     }
+    dinfo->guest_width /= 8;
 
     xc_core_arch_context_init(&arch_ctxt);
     if ( (dump_mem_start = malloc(DUMP_INCREMENT*PAGE_SIZE)) == NULL )
diff --git a/tools/libxc/xc_cpuid_x86.c b/tools/libxc/xc_cpuid_x86.c
--- a/tools/libxc/xc_cpuid_x86.c
+++ b/tools/libxc/xc_cpuid_x86.c
@@ -436,17 +436,15 @@ static void xc_cpuid_pv_policy(
     const unsigned int *input, unsigned int *regs)
 {
     DECLARE_DOMCTL;
+    unsigned int guest_width;
     int guest_64bit, xen_64bit = hypervisor_is_64bit(xch);
     char brand[13];
     uint64_t xfeature_mask;
 
     xc_cpuid_brand_get(brand);
 
-    memset(&domctl, 0, sizeof(domctl));
-    domctl.domain = domid;
-    domctl.cmd = XEN_DOMCTL_get_address_size;
-    do_domctl(xch, &domctl);
-    guest_64bit = (domctl.u.address_size.size == 64);
+    xc_domain_get_address_size(xch, domid, &guest_width);
+    guest_64bit = (guest_width == 64);
 
     /* Detecting Xen''s atitude towards XSAVE */
     memset(&domctl, 0, sizeof(domctl));
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -270,6 +270,21 @@ out:
     return ret;
 }
 
+int xc_domain_get_address_size(xc_interface *xch, uint32_t domid,
+                               unsigned int *addr_size)
+{
+    DECLARE_DOMCTL;
+
+    memset(&domctl, 0, sizeof(domctl));
+    domctl.domain = domid;
+    domctl.cmd = XEN_DOMCTL_get_address_size;
+
+    if ( do_domctl(xch, &domctl) != 0 )
+        return 1;
+
+    *addr_size = domctl.u.address_size.size;
+    return 0;
+}
 
 int xc_domain_getinfo(xc_interface *xch,
                       uint32_t first_domid,
diff --git a/tools/libxc/xc_offline_page.c b/tools/libxc/xc_offline_page.c
--- a/tools/libxc/xc_offline_page.c
+++ b/tools/libxc/xc_offline_page.c
@@ -193,20 +193,15 @@ static int get_pt_level(xc_interface *xc
                         unsigned int *pt_level,
                         unsigned int *gwidth)
 {
-    DECLARE_DOMCTL;
     xen_capabilities_info_t xen_caps = "";
 
     if (xc_version(xch, XENVER_capabilities, &xen_caps) != 0)
         return -1;
 
-    memset(&domctl, 0, sizeof(domctl));
-    domctl.domain = domid;
-    domctl.cmd = XEN_DOMCTL_get_address_size;
-
-    if ( do_domctl(xch, &domctl) != 0 )
+    if (xc_domain_get_address_size(xch, domid, gwidth) != 0)
         return -1;
 
-    *gwidth = domctl.u.address_size.size / 8;
+    *gwidth /= 8;
 
     if (strstr(xen_caps, "xen-3.0-x86_64"))
         /* Depends on whether it''s a compat 32-on-64 guest */
diff --git a/tools/libxc/xc_pagetab.c b/tools/libxc/xc_pagetab.c
--- a/tools/libxc/xc_pagetab.c
+++ b/tools/libxc/xc_pagetab.c
@@ -51,15 +51,13 @@ unsigned long xc_translate_foreign_addre
         pt_levels = (ctx.msr_efer&EFER_LMA) ? 4 : (ctx.cr4&CR4_PAE) ? 3
: 2;
         paddr = ctx.cr3 & ((pt_levels == 3) ? ~0x1full : ~0xfffull);
     } else {
-        DECLARE_DOMCTL;
+        unsigned int gwidth;
         vcpu_guest_context_any_t ctx;
         if (xc_vcpu_getcontext(xch, dom, vcpu, &ctx) != 0)
             return 0;
-        domctl.domain = dom;
-        domctl.cmd = XEN_DOMCTL_get_address_size;
-        if ( do_domctl(xch, &domctl) != 0 )
+        if (xc_domain_get_address_size(xch, dom, &gwidth) != 0)
             return 0;
-        if (domctl.u.address_size.size == 64) {
+        if (gwidth == 64) {
             pt_levels = 4;
             paddr = (uint64_t)xen_cr3_to_pfn_x86_64(ctx.x64.ctrlreg[3])
                 << PAGE_SHIFT;
diff --git a/tools/libxc/xc_resume.c b/tools/libxc/xc_resume.c
--- a/tools/libxc/xc_resume.c
+++ b/tools/libxc/xc_resume.c
@@ -24,19 +24,6 @@
 #include <xen/foreign/x86_64.h>
 #include <xen/hvm/params.h>
 
-static int pv_guest_width(xc_interface *xch, uint32_t domid)
-{
-    DECLARE_DOMCTL;
-    domctl.domain = domid;
-    domctl.cmd = XEN_DOMCTL_get_address_size;
-    if ( xc_domctl(xch, &domctl) != 0 )
-    {
-        PERROR("Could not get guest address size");
-        return -1;
-    }
-    return domctl.u.address_size.size / 8;
-}
-
 static int modify_returncode(xc_interface *xch, uint32_t domid)
 {
     vcpu_guest_context_any_t ctxt;
@@ -71,9 +58,9 @@ static int modify_returncode(xc_interfac
     else
     {
         /* Probe PV guest address width. */
-        dinfo->guest_width = pv_guest_width(xch, domid);
-        if ( dinfo->guest_width < 0 )
+        if ( xc_domain_get_address_size(xch, domid, &dinfo->guest_width)
)
             return -1;
+        dinfo->guest_width /= 8;
     }
 
     if ( (rc = xc_vcpu_getcontext(xch, domid, 0, &ctxt)) != 0 )
@@ -120,7 +107,8 @@ static int xc_domain_resume_any(xc_inter
     xc_dominfo_t info;
     int i, rc = -1;
 #if defined(__i386__) || defined(__x86_64__)
-    struct domain_info_context _dinfo = { .p2m_size = 0 };
+    struct domain_info_context _dinfo = { .guest_width = 0,
+                                          .p2m_size = 0 };
     struct domain_info_context *dinfo = &_dinfo;
     unsigned long mfn;
     vcpu_guest_context_any_t ctxt;
@@ -147,7 +135,8 @@ static int xc_domain_resume_any(xc_inter
         return rc;
     }
 
-    dinfo->guest_width = pv_guest_width(xch, domid);
+    xc_domain_get_address_size(xch, domid, &dinfo->guest_width);
+    dinfo->guest_width /= 8;
     if ( dinfo->guest_width != sizeof(long) )
     {
         ERROR("Cannot resume uncooperative cross-address-size
guests");
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -556,6 +556,18 @@ int xc_vcpu_getaffinity(xc_interface *xc
                         int vcpu,
                         xc_cpumap_t cpumap);
 
+
+/**
+ * This function will return the address size for the specified domain.
+ *
+ * @param xch a handle to an open hypervisor interface.
+ * @param domid the domain id one wants the address size width of.
+ * @param addr_size the address size.
+ */
+int xc_domain_get_address_size(xc_interface *xch, uint32_t domid,
+                               unsigned int *addr_size);
+
+
 /**
  * This function will return information about one or more domains. It is
  * designed to iterate over the list of domains. If a single domain is
diff --git a/tools/libxc/xg_save_restore.h b/tools/libxc/xg_save_restore.h
--- a/tools/libxc/xg_save_restore.h
+++ b/tools/libxc/xg_save_restore.h
@@ -301,7 +301,6 @@ static inline int get_platform_info(xc_i
 {
     xen_capabilities_info_t xen_caps = "";
     xen_platform_parameters_t xen_params;
-    DECLARE_DOMCTL;
 
     if (xc_version(xch, XENVER_platform_parameters, &xen_params) != 0)
         return 0;
@@ -313,14 +312,10 @@ static inline int get_platform_info(xc_i
 
     *hvirt_start = xen_params.virt_start;
 
-    memset(&domctl, 0, sizeof(domctl));
-    domctl.domain = dom;
-    domctl.cmd = XEN_DOMCTL_get_address_size;
-
-    if ( do_domctl(xch, &domctl) != 0 )
+    if ( xc_domain_get_address_size(xch, dom, guest_width) != 0)
         return 0; 
 
-    *guest_width = domctl.u.address_size.size / 8;
+    *guest_width /= 8;
 
     /* 64-bit tools will see the 64-bit hvirt_start, but 32-bit guests 
      * will be using the compat one. */
diff --git a/tools/xentrace/xenctx.c b/tools/xentrace/xenctx.c
--- a/tools/xentrace/xenctx.c
+++ b/tools/xentrace/xenctx.c
@@ -892,12 +892,9 @@ static void dump_ctx(int vcpu)
             }
             ctxt_word_size = (strstr(xen_caps, "xen-3.0-x86_64")) ? 8
: 4;
         } else {
-            struct xen_domctl domctl;
-            memset(&domctl, 0, sizeof domctl);
-            domctl.domain = xenctx.domid;
-            domctl.cmd = XEN_DOMCTL_get_address_size;
-            if (xc_domctl(xenctx.xc_handle, &domctl) == 0)
-                ctxt_word_size = guest_word_size = domctl.u.address_size.size /
8;
+            unsigned int gw;
+            if ( !xc_domain_get_address_size(xenctx.xc_handle, xenctx.domid,
&gw) )
+                ctxt_word_size = guest_word_size = gw / 8;
         }
     }
 #endif

Dario Faggioli

2013-Apr-09 02:49 UTC

head link

[PATCH 4 of 8 [RFC]] libxc: introduce xc_map_domain_meminfo (and xc_unmap_domain_meminfo)

And use it in xc_exchange_page(). This is basically because the
following changes need something really similar to the set of
steps that are here abstracted in these two functions.

This is basically pure code motion and, despite of the change in the
interface and in the signature of many functions, no functional change
is involved.

XXX: There is probably more room for using this in other places too,
     most likely, the save/restore code.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -21,6 +21,8 @@
  */
 
 #include "xc_private.h"
+#include "xc_core.h"
+#include "xg_private.h"
 #include "xg_save_restore.h"
 #include <xen/memory.h>
 #include <xen/hvm/hvm_op.h>
@@ -1460,6 +1462,132 @@ int xc_domain_bind_pt_isa_irq(
                                   PT_IRQ_TYPE_ISA, 0, 0, 0, machine_irq));
 }
 
+int xc_unmap_domain_meminfo(xc_interface *xch, struct xc_domain_meminfo *minfo)
+{
+    struct domain_info_context _di = { .guest_width = minfo->guest_width };
+    struct domain_info_context *dinfo = &_di;
+
+    free(minfo->pfn_type);
+    if ( minfo->p2m_table )
+        munmap(minfo->p2m_table, P2M_FLL_ENTRIES * PAGE_SIZE);
+    minfo->p2m_table = NULL;
+
+    return 0;
+}
+
+int xc_map_domain_meminfo(xc_interface *xch, int domid,
+                          struct xc_domain_meminfo *minfo)
+{
+    struct domain_info_context _di;
+    struct domain_info_context *dinfo = &_di;
+
+    xc_dominfo_t info;
+    shared_info_any_t *live_shinfo;
+    xen_capabilities_info_t xen_caps = "";
+    int i;
+
+    /* Only be initialized once */
+    if ( minfo->pfn_type || minfo->p2m_table )
+    {
+        errno = EINVAL;
+        return -1;
+    }
+
+    if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 )
+    {
+        PERROR("Could not get domain info");
+        return -1;
+    }
+
+    if ( xc_domain_get_address_size(xch, domid, &minfo->guest_width) )
+    {
+        PERROR("Could not get domain address size");
+        return -1;
+    }
+    minfo->guest_width /= 8;
+    _di.guest_width = minfo->guest_width;
+
+    /* Get page table levels (see get_platform_info() in xg_save_restore.h */
+    if ( xc_version(xch, XENVER_capabilities, &xen_caps) )
+    {
+        PERROR("Could not get Xen capabilities (for page table
levels)");
+        return -1;
+    }
+    if ( strstr(xen_caps, "xen-3.0-x86_64") )
+        /* Depends on whether it''s a compat 32-on-64 guest */
+        minfo->pt_levels = ( (minfo->guest_width == 8) ? 4 : 3 );
+    else if ( strstr(xen_caps, "xen-3.0-x86_32p") )
+        minfo->pt_levels = 3;
+    else if ( strstr(xen_caps, "xen-3.0-x86_32") )
+        minfo->pt_levels = 2;
+    else
+    {
+        errno = EFAULT;
+        return -1;
+    }
+
+    /* We need the shared info page for mapping the P2M */
+    live_shinfo = xc_map_foreign_range(xch, domid, PAGE_SIZE, PROT_READ,
+                                       info.shared_info_frame);
+    if ( !live_shinfo )
+    {
+        PERROR("Could not map the shared info frame (MFN 0x%lx)",
+               info.shared_info_frame);
+        return -1;
+    }
+
+    if ( xc_core_arch_map_p2m_writable(xch, minfo->guest_width, &info,
+                                       live_shinfo, &minfo->p2m_table,
+                                       &minfo->p2m_size) )
+    {
+        PERROR("Could not map the P2M table");
+        munmap(live_shinfo, PAGE_SIZE);
+        return -1;
+    }
+    munmap(live_shinfo, PAGE_SIZE);
+    _di.p2m_size = minfo->p2m_size;
+
+    /* Make space and prepare for getting the PFN types */
+    minfo->pfn_type = calloc(sizeof(*minfo->pfn_type),
minfo->p2m_size);
+    if ( !minfo->pfn_type )
+    {
+        PERROR("Could not allocate memory for the PFN types");
+        goto failed;
+    }
+    for ( i = 0; i < minfo->p2m_size; i++ )
+        minfo->pfn_type[i] = pfn_to_mfn(i, minfo->p2m_table,
+                                        minfo->guest_width);
+
+    /* Retrieve PFN types in batches */
+    for ( i = 0; i < minfo->p2m_size ; i+=1024 )
+    {
+        int count = ((minfo->p2m_size - i ) > 1024 ) ?
+                        1024: (minfo->p2m_size - i);
+
+        if ( xc_get_pfn_type_batch(xch, domid, count, minfo->pfn_type + i) )
+        {
+            PERROR("Could not get %d-eth batch of PFN types",
(i+1)/1024);
+            goto failed;
+        }
+    }
+
+    return 0;
+
+failed:
+    if ( minfo->pfn_type )
+    {
+        free(minfo->pfn_type);
+        minfo->pfn_type = NULL;
+    }
+    if ( minfo->p2m_table )
+    {
+        munmap(minfo->p2m_table, P2M_FLL_ENTRIES * PAGE_SIZE);
+        minfo->p2m_table = NULL;
+    }
+
+    return -1;
+}
+
 int xc_domain_memory_mapping(
     xc_interface *xch,
     uint32_t domid,
diff --git a/tools/libxc/xc_offline_page.c b/tools/libxc/xc_offline_page.c
--- a/tools/libxc/xc_offline_page.c
+++ b/tools/libxc/xc_offline_page.c
@@ -33,17 +33,6 @@
 #include "xg_private.h"
 #include "xg_save_restore.h"
 
-struct domain_mem_info{
-    int domid;
-    unsigned int pt_level;
-    unsigned int guest_width;
-    xen_pfn_t *pfn_type;
-    xen_pfn_t *p2m_table;
-    unsigned long p2m_size;
-    xen_pfn_t *m2p_table;
-    int max_mfn;
-};
-
 struct pte_backup_entry
 {
     xen_pfn_t table_mfn;
@@ -180,141 +169,6 @@ static int xc_is_page_granted_v2(xc_inte
    return (i != gnt_num);
 }
 
-static xen_pfn_t pfn_to_mfn(xen_pfn_t pfn, xen_pfn_t *p2m, int gwidth)
-{
-  return ((xen_pfn_t) ((gwidth==8)?
-                       (((uint64_t *)p2m)[(pfn)]):
-                       ((((uint32_t *)p2m)[(pfn)]) == 0xffffffffU ?
-                            (-1UL) :
-                            (((uint32_t *)p2m)[(pfn)]))));
-}
-
-static int get_pt_level(xc_interface *xch, uint32_t domid,
-                        unsigned int *pt_level,
-                        unsigned int *gwidth)
-{
-    xen_capabilities_info_t xen_caps = "";
-
-    if (xc_version(xch, XENVER_capabilities, &xen_caps) != 0)
-        return -1;
-
-    if (xc_domain_get_address_size(xch, domid, gwidth) != 0)
-        return -1;
-
-    *gwidth /= 8;
-
-    if (strstr(xen_caps, "xen-3.0-x86_64"))
-        /* Depends on whether it''s a compat 32-on-64 guest */
-        *pt_level = ( (*gwidth == 8) ? 4 : 3 );
-    else if (strstr(xen_caps, "xen-3.0-x86_32p"))
-        *pt_level = 3;
-    else if (strstr(xen_caps, "xen-3.0-x86_32"))
-        *pt_level = 2;
-    else
-        return -1;
-
-    return 0;
-}
-
-static int close_mem_info(xc_interface *xch, struct domain_mem_info *minfo)
-{
-    if (minfo->pfn_type)
-        free(minfo->pfn_type);
-    munmap(minfo->m2p_table, M2P_SIZE(minfo->max_mfn));
-    munmap(minfo->p2m_table, P2M_FLL_ENTRIES * PAGE_SIZE);
-    minfo->p2m_table = minfo->m2p_table = NULL;
-
-    return 0;
-}
-
-static int init_mem_info(xc_interface *xch, int domid,
-                 struct domain_mem_info *minfo,
-                 xc_dominfo_t *info)
-{
-    uint64_aligned_t shared_info_frame;
-    shared_info_any_t *live_shinfo = NULL;
-    int i, rc;
-
-    /* Only be initialized once */
-    if (minfo->pfn_type || minfo->m2p_table || minfo->p2m_table)
-        return -EINVAL;
-
-    if ( get_pt_level(xch, domid, &minfo->pt_level,
-                      &minfo->guest_width) )
-    {
-        ERROR("Unable to get PT level info.");
-        return -EFAULT;
-    }
-    dinfo->guest_width = minfo->guest_width;
-
-    shared_info_frame = info->shared_info_frame;
-
-    live_shinfo = xc_map_foreign_range(xch, domid,
-                     PAGE_SIZE, PROT_READ, shared_info_frame);
-    if ( !live_shinfo )
-    {
-        ERROR("Couldn''t map live_shinfo");
-        return -EFAULT;
-    }
-
-    if ( (rc = xc_core_arch_map_p2m_writable(xch, minfo->guest_width,
-              info, live_shinfo, &minfo->p2m_table, 
&minfo->p2m_size)) )
-    {
-        ERROR("Couldn''t map p2m table %x\n", rc);
-        goto failed;
-    }
-    munmap(live_shinfo, PAGE_SIZE);
-    live_shinfo = NULL;
-
-    dinfo->p2m_size = minfo->p2m_size;
-
-    minfo->max_mfn = xc_maximum_ram_page(xch);
-    if ( !(minfo->m2p_table -        xc_map_m2p(xch, minfo->max_mfn,
PROT_READ, NULL)) )
-    {
-        ERROR("Failed to map live M2P table");
-        goto failed;
-    }
-
-    /* Get pfn type */
-    minfo->pfn_type = calloc(sizeof(*minfo->pfn_type),
minfo->p2m_size);
-    if (!minfo->pfn_type)
-    {
-        ERROR("Failed to malloc pfn_type\n");
-        goto failed;
-    }
-
-    for (i = 0; i < minfo->p2m_size; i++)
-        minfo->pfn_type[i] = pfn_to_mfn(i, minfo->p2m_table,
-                                        minfo->guest_width);
-
-    for (i = 0; i < minfo->p2m_size ; i+=1024)
-    {
-        int count = ((dinfo->p2m_size - i ) > 1024 ) ? 1024:
(dinfo->p2m_size - i);
-        if ( ( rc = xc_get_pfn_type_batch(xch, domid, count,
-                  minfo->pfn_type + i)) )
-        {
-            ERROR("Failed to get pfn_type %x\n", rc);
-            goto failed;
-        }
-    }
-    return 0;
-
-failed:
-    if (minfo->pfn_type)
-    {
-        free(minfo->pfn_type);
-        minfo->pfn_type = NULL;
-    }
-    if (live_shinfo)
-        munmap(live_shinfo, PAGE_SIZE);
-    munmap(minfo->m2p_table, M2P_SIZE(minfo->max_mfn));
-    munmap(minfo->p2m_table, P2M_FLL_ENTRIES * PAGE_SIZE);
-    minfo->p2m_table = minfo->m2p_table = NULL;
-
-    return -1;
-}
-
 static int backup_ptes(xen_pfn_t table_mfn, int offset,
                        struct pte_backup *backup)
 {
@@ -404,7 +258,7 @@ static int __update_pte(xc_interface *xc
 }
 
 static int change_pte(xc_interface *xch, int domid,
-                     struct domain_mem_info *minfo,
+                     struct xc_domain_meminfo *minfo,
                      struct pte_backup *backup,
                      struct xc_mmu *mmu,
                      pte_func func,
@@ -414,7 +268,7 @@ static int change_pte(xc_interface *xch,
     uint64_t i;
     void *content = NULL;
 
-    pte_num = PAGE_SIZE / ((minfo->pt_level == 2) ? 4 : 8);
+    pte_num = PAGE_SIZE / ((minfo->pt_levels == 2) ? 4 : 8);
 
     for (i = 0; i < minfo->p2m_size; i++)
     {
@@ -437,7 +291,7 @@ static int change_pte(xc_interface *xch,
 
             for (j = 0; j < pte_num; j++)
             {
-                if ( minfo->pt_level == 2 )
+                if ( minfo->pt_levels == 2 )
                     pte = ((const uint32_t*)content)[j];
                 else
                     pte = ((const uint64_t*)content)[j];
@@ -449,7 +303,7 @@ static int change_pte(xc_interface *xch,
                     case 1:
                     if ( xc_add_mmu_update(xch, mmu,
                           table_mfn << PAGE_SHIFT |
-                          j * ( (minfo->pt_level == 2) ?
+                          j * ( (minfo->pt_levels == 2) ?
                               sizeof(uint32_t): sizeof(uint64_t)) |
                           MMU_PT_UPDATE_PRESERVE_AD,
                           new_pte) )
@@ -482,7 +336,7 @@ failed:
 }
 
 static int update_pte(xc_interface *xch, int domid,
-                     struct domain_mem_info *minfo,
+                     struct xc_domain_meminfo *minfo,
                      struct pte_backup *backup,
                      struct xc_mmu *mmu,
                      unsigned long new_mfn)
@@ -492,7 +346,7 @@ static int update_pte(xc_interface *xch,
 }
 
 static int clear_pte(xc_interface *xch, int domid,
-                     struct domain_mem_info *minfo,
+                     struct xc_domain_meminfo *minfo,
                      struct pte_backup *backup,
                      struct xc_mmu *mmu,
                      xen_pfn_t mfn)
@@ -540,7 +394,7 @@ static int is_page_exchangable(xc_interf
 int xc_exchange_page(xc_interface *xch, int domid, xen_pfn_t mfn)
 {
     xc_dominfo_t info;
-    struct domain_mem_info minfo;
+    struct xc_domain_meminfo minfo;
     struct xc_mmu *mmu = NULL;
     struct pte_backup old_ptes = {NULL, 0, 0};
     grant_entry_v1_t *gnttab_v1 = NULL;
@@ -551,6 +405,8 @@ int xc_exchange_page(xc_interface *xch, 
     int rc, result = -1;
     uint32_t status;
     xen_pfn_t new_mfn, gpfn;
+    xen_pfn_t *m2p_table;
+    int max_mfn;
 
     if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 )
     {
@@ -570,10 +426,26 @@ int xc_exchange_page(xc_interface *xch, 
         return -EINVAL;
     }
 
-    /* Get domain''s memory information */
+    /* Map M2P and obtain gpfn */
+    max_mfn = xc_maximum_ram_page(xch);
+    if ( !(m2p_table = xc_map_m2p(xch, max_mfn, PROT_READ, NULL)) )
+    {
+        PERROR("Failed to map live M2P table");
+        return -EFAULT;
+    }
+    gpfn = m2p_table[mfn];
+
+    /* Map domain''s memory information */
     memset(&minfo, 0, sizeof(minfo));
-    init_mem_info(xch, domid, &minfo, &info);
-    gpfn = minfo.m2p_table[mfn];
+    if ( xc_map_domain_meminfo(xch, domid, &minfo) )
+    {
+        PERROR("Could not map domain''s memory
information\n");
+        return -EFAULT;
+    }
+
+    /* For translation macros */
+    dinfo->guest_width = minfo.guest_width;
+    dinfo->p2m_size = minfo.p2m_size;
 
     /* Don''t exchange CR3 for PAE guest in PAE host environment */
     if (minfo.guest_width > sizeof(long))
@@ -763,7 +635,8 @@ failed:
     if (gnttab_v2)
         munmap(gnttab_v2, gnt_num / (PAGE_SIZE/sizeof(grant_entry_v2_t)));
 
-    close_mem_info(xch, &minfo);
+    xc_unmap_domain_meminfo(xch, &minfo);
+    munmap(m2p_table, M2P_SIZE(max_mfn));
 
     return result;
 }
diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h
--- a/tools/libxc/xenguest.h
+++ b/tools/libxc/xenguest.h
@@ -274,6 +274,23 @@ int xc_exchange_page(xc_interface *xch, 
 
 
 /**
+ * Memory related information, such as PFN types, the P2M table,
+ * the guest word width and the guest page table levels.
+ */
+struct xc_domain_meminfo {
+    unsigned int pt_levels;
+    unsigned int guest_width;
+    xen_pfn_t *pfn_type;
+    xen_pfn_t *p2m_table;
+    unsigned long p2m_size;
+};
+
+int xc_map_domain_meminfo(xc_interface *xch, int domid,
+                          struct xc_domain_meminfo *minfo);
+
+int xc_unmap_domain_meminfo(xc_interface *xch, struct xc_domain_meminfo *mem);
+
+/**
  * This function map m2p table
  * @parm xch a handle to an open hypervisor interface
  * @parm max_mfn the max pfn
diff --git a/tools/libxc/xg_private.h b/tools/libxc/xg_private.h
--- a/tools/libxc/xg_private.h
+++ b/tools/libxc/xg_private.h
@@ -136,6 +136,15 @@ struct domain_info_context {
     unsigned long p2m_size;
 };
 
+static inline xen_pfn_t pfn_to_mfn(xen_pfn_t pfn, xen_pfn_t *p2m, int gwidth)
+{
+  return ((xen_pfn_t) ((gwidth==8)?
+                       (((uint64_t *)p2m)[(pfn)]):
+                       ((((uint32_t *)p2m)[(pfn)]) == 0xffffffffU ?
+                            (-1UL) :
+                            (((uint32_t *)p2m)[(pfn)]))));
+}
+
 /* Number of xen_pfn_t in a page */
 #define FPP             (PAGE_SIZE/(dinfo->guest_width))

Dario Faggioli

2013-Apr-09 02:49 UTC

head link

[PATCH 5 of 8 [RFC]] libxc: allow for ctxt to be NULL in xc_vcpu_setcontext

since, as can be seen in xen/common/domctl.c, that is not a
problem. Actually, it''s how calling vcpu_reset() is exported
outside.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -1120,12 +1120,6 @@ int xc_vcpu_setcontext(xc_interface *xch
     DECLARE_HYPERCALL_BOUNCE(ctxt, sizeof(vcpu_guest_context_any_t),
XC_HYPERCALL_BUFFER_BOUNCE_IN);
     int rc;
 
-    if (ctxt == NULL)
-    {
-        errno = EINVAL;
-        return -1;
-    }
-
     if ( xc_hypercall_bounce_pre(xch, ctxt) )
         return -1;

Dario Faggioli

2013-Apr-09 02:49 UTC

head link

[PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

as a mechanism of deallocating and reallocating (immediately!) _all_
the memory of a domain. Notice it relies on the guest being suspended
already, before the function is invoked.

Of course, it is quite likely that the memory ends up in different
places from where it was before calling it but, for instance, the fact
that this is actually a different NUMA node (or anything else) does not
depend by any means from this function.

In fact, here the guest pages are just freed and immediately
re-allocated (you can see it as a very quick, back-to-back save-restore
cycle).

If the current domain configuration says, for instance, that new
allocations should go to a specific NUMA node, then the whole domain
is, as a matter of facts, moved there, but again, this is not
something this function does explicitly.

The way we do this is, very briefly, as follows:
 1. drop all the references to all the pages of a domain,
 2. backup the content of a batch of pages,
 3. deallocate the a batch,
 4. allocate a new set of pages for the batch,
 5. copy the backed up content in the new pages,
 6. if there are more pages, go back to 2, othwrwise
 7. update the page tables, the vcpu contexts, the P2M, etc.

The above raises a number of quite complex issues and, _not_all_
of them are being dealt with or solved in this series (RFC means
something after all, doesn''t it? ;-P).

XXX Open issues are:
     - HVM ("easy" to add, but it''s not in this patch. See
the
            cover letter for the series);
     - PAE guests, as they need special attention for some of
       the page tables (should be trivial to add);
     - grant tables/granted pages: how to move them?
     - TMEM: how to "move" it?
     - shared/paged pages: what to do with them?
     - guest pages mapped in Xen, for instance:
        * vcpu info pages: moved but, how to update the mapping?
        * EOI page: moved but, how to update the mapping?

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -48,6 +48,11 @@ else
 GUEST_SRCS-y += xc_nomigrate.c
 endif
 
+# XXX: Well, for sure there are some X86-ism in the current code.
+#      Making it more ARM friendly should not be a big deal though,
+#      will do for next release.
+GUEST_SRCS-$(CONFIG_X86) += xc_domain_movemem.c
+
 vpath %.c ../../xen/common/libelf
 CFLAGS += -I../../xen/common/libelf
 
diff --git a/tools/libxc/xc_domain_movemem.c b/tools/libxc/xc_domain_movemem.c
new file mode 100644
--- /dev/null
+++ b/tools/libxc/xc_domain_movemem.c
@@ -0,0 +1,766 @@
+/******************************************************************************
+ * xc_domain_movemem.c
+ *
+ * Deallocate and reallocate all the memory of a domain.
+ *
+ * Copyright (c) 2013, Dario Faggioli.
+ * Copyright (c) 2012, Citrix Systems, Inc.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation;
+ * version 2.1 of the License.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301 
USA
+ */
+
+#include <inttypes.h>
+#include <time.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/time.h>
+#include <xc_core.h>
+
+#include "xc_private.h"
+#include "xc_dom.h"
+#include "xg_private.h"
+#include "xg_save_restore.h"
+
+/* Needed from translation macros in xg_private.h */
+static struct domain_info_context _dinfo;
+static struct domain_info_context *dinfo = &_dinfo;
+
+#define MAX_BATCH_SIZE    1024
+#define MAX_PIN_BATCH     1024
+
+#define MFN_IS_IN_PSEUDOPHYS_MAP(_mfn, _max_mfn, _minfo, _m2p) \
+    (((_mfn) < (_max_mfn)) && ((mfn_to_pfn(_mfn, _m2p) <
(_minfo).p2m_size) && \
+      (pfn_to_mfn(mfn_to_pfn(_mfn, _m2p), (_minfo).p2m_table, \
+                  (_minfo).guest_width) == (_mfn))))
+
+/*
+ * This is to determine which entries in this page table hold reserved
+ * hypervisor mappings. This depends on the current page table type as
+ * well as the number of paging levels (see also xc_domain_save.c).
+ *
+ * XXX: export this function so that it can be used both here and from
+ *      canonicalize_pagetable(), in xc_domain_save.c.
+ */
+static int is_xen_mapping(struct xc_domain_meminfo *minfo, unsigned long type,
+                          unsigned long hvirt_start, unsigned long m2p_mfn0,
+                          const void *spage, int pte)
+{
+    int xen_start, xen_end, pte_last;
+
+    xen_start = xen_end = pte_last = PAGE_SIZE / 8;
+
+    if ( (minfo->pt_levels == 3) && (type ==
XEN_DOMCTL_PFINFO_L3TAB) )
+        xen_start = L3_PAGETABLE_ENTRIES_PAE;
+
+    /*
+     * In PAE only the L2 mapping the top 1GB contains Xen mappings.
+     * We can spot this by looking for the guest''s mappingof the m2p.
+     * Guests must ensure that this check will fail for other L2s.
+     */
+    if ( (minfo->pt_levels == 3) && (type ==
XEN_DOMCTL_PFINFO_L2TAB) )
+    {
+        int hstart;
+        uint64_t he;
+
+        hstart = (hvirt_start >> L2_PAGETABLE_SHIFT_PAE) & 0x1ff;
+        he = ((const uint64_t *) spage)[hstart];
+
+        if ( ((he >> PAGE_SHIFT) & MFN_MASK_X86) == m2p_mfn0 )
+        {
+            /* hvirt starts with xen stuff... */
+            xen_start = hstart;
+        }
+        else if ( hvirt_start != 0xf5800000 )
+        {
+            /* old L2s from before hole was shrunk... */
+            hstart = (0xf5800000 >> L2_PAGETABLE_SHIFT_PAE) & 0x1ff;
+            he = ((const uint64_t *) spage)[hstart];
+            if ( ((he >> PAGE_SHIFT) & MFN_MASK_X86) == m2p_mfn0 )
+                xen_start = hstart;
+        }
+    }
+
+    if ( (minfo->pt_levels == 4) && (type ==
XEN_DOMCTL_PFINFO_L4TAB) )
+    {
+        /*
+         * XXX SMH: should compute these from hvirt_start (which we have)
+         * and hvirt_end (which we don''t)
+         */
+        xen_start = 256;
+        xen_end   = 272;
+    }
+
+    return pte >= xen_start && pte < xen_end;
+}
+
+/*
+ * This function will basically deallocate _all_ the memory of a domain and
+ * reallocate it immediately. It relies on the guest being suspended
+ * already, before the function is even invoked.
+ *
+ * Of course, it is quite likely that the memory ends up in different places
+ * from where it was before calling this but, for instance, the fact that
+ * this is actually a different NUMA node (or anything else) does not
+ * depend by any means from this function. In fact, here the guest pages are
+ * just freed and immediately re-allocated (you can see it as a very quick,
+ * back-to-back domain_save--domain_restore). If the current domain
+ * configuration says, for instance, that new allocation should go to a
+ * different NUMA nodes, then the whole domain is moved to there, but again,
+ * this is not something this function does explicitly.
+ *
+ * If actually interested in doing something like that (i.e., moving the
+ * domain to a different NUMA node), calling xc_domain_node_setaffinity()
+ * right before this should achieve it.
+ */
+int xc_domain_move_memory(xc_interface *xch, uint32_t domid/*, int hvm*/)
+{
+    unsigned int i, j;
+    int rc = 1;
+
+    xc_dominfo_t info;
+    struct xc_domain_meminfo minfo;
+
+    struct mmuext_op pin[MAX_PIN_BATCH];
+    unsigned int nr_pins;
+
+    struct xc_mmu *mmu = NULL;
+    unsigned int xen_pt_levels, dom_guest_width;
+    unsigned long max_mfn, hvirt_start, m2p_mfn0;
+    vcpu_guest_context_any_t ctxt;
+
+    void *live_p2m_frame_list_list = NULL;
+    void *live_p2m_frame_list = NULL;
+
+    /*
+     * XXX: grant tables & granted pages need to be considered, e.g.,
+     *      using xc_is_page_granted_vX() in xc_offline_page.c to
+     *      recognise them, etc.
+    int gnt_num;
+    grant_entry_v1_t *gnttab_v1 = NULL;
+    grant_entry_v2_t *gnttab_v2 = NULL;
+     */
+
+    void *old_p, *new_p, *backup = NULL;
+    unsigned long mfn, pfn;
+    uint64_t fll;
+
+    xen_pfn_t *new_mfns= NULL, *old_mfns = NULL, *batch_pfns = NULL;
+    int pte_num = PAGE_SIZE / 8, cleared_pte = 0;
+    xen_pfn_t *m2p_table, *orig_m2p = NULL;
+    shared_info_any_t *live_shinfo = NULL;
+
+    unsigned long n = 0, n_skip = 0;
+
+    int debug = 0; /* XXX will become a parameter */
+
+    if ( !get_platform_info(xch, domid, &max_mfn, &hvirt_start,
+                            &xen_pt_levels, &dom_guest_width) )
+    {
+        ERROR("Failed getting platform info");
+        return 1;
+    }
+
+    /* We expect domain to be suspende already */
+    if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 )
+    {
+        PERROR("Failed getting domain info");
+        return 1;
+    }
+    if ( !info.shutdown || info.shutdown_reason != SHUTDOWN_suspend)
+    {
+        PERROR("Domain appears not to be suspended");
+        return 1;
+    }
+
+    DBGPRINTF("Establishing the mappings for M2P and P2M");
+    memset(&minfo, 0, sizeof(minfo));
+    if ( !(m2p_table = xc_map_m2p(xch, max_mfn, PROT_READ, &m2p_mfn0)) )
+    {
+        PERROR("Failed to map the M2P table");
+        return 1;
+    }
+    if ( xc_map_domain_meminfo(xch, domid, &minfo) )
+    {
+        PERROR("Failed to map domain''s memory information");
+        goto out;
+    }
+    dinfo->guest_width = minfo.guest_width;
+    dinfo->p2m_size = minfo.p2m_size;
+
+    /*
+     * XXX
+    DBGPRINTF("Mapping the grant tables");
+    gnttab_v2 = xc_gnttab_map_table_v2(xch, domid, &gnt_num);
+    if (!gnttab_v2)
+    {
+        PERROR("Failed to map V1 grant table... Trying V1");
+        gnttab_v1 = xc_gnttab_map_table_v1(xch, domid, &gnt_num);
+        if (!gnttab_v1)
+        {
+            PERROR("Failed to map grant table");
+            goto out;
+        }
+    }
+    DBGPRINTF("Grant table mapped. %d grants found", gnt_num);
+     */
+
+    mmu = xc_alloc_mmu_updates(xch, (domid+1)<<16|domid);
+    if ( mmu == NULL )
+    {
+        PERROR("Failed to allocate memory for MMU updates");
+        goto out;
+    }
+
+    /* Alloc support data structures */
+    new_mfns = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t));
+    old_mfns = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t));
+    batch_pfns = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t));
+
+    backup = malloc(PAGE_SIZE * MAX_BATCH_SIZE);
+
+    orig_m2p = calloc(max_mfn, sizeof(xen_pfn_t));
+
+    if ( !new_mfns || !old_mfns || !batch_pfns || !backup || !orig_m2p )
+    {
+        ERROR("Failed to allocate copying and/or backup data
structures");
+        goto out;
+    }
+
+    DBGPRINTF("Saving the original M2P");
+    memcpy(orig_m2p, m2p_table, max_mfn * sizeof(xen_pfn_t));
+
+    DBGPRINTF("Starting deallocating and reallocating all memory for
domain %d"
+              "\n\tnr_pages=%lu, nr_shared_pages=%lu,
nr_paged_pages=%lu"
+              "\n\tnr_online_vcpus=%u, max_vcpu_id=%u",
+              domid, info.nr_pages, info.nr_shared_pages, info.nr_paged_pages,
+              info.nr_online_vcpus, info.max_vcpu_id);
+
+    /* Beware: no going back from this point!! */
+
+    /*
+     * As a part of the process of dropping all the references to the existing
+     * pages in memory, so that we can free (and then re-allocate them) we need
+     * to unpin them.
+     *
+     * We do that in batches of 1024 PFNs at each step, to amortize the cost
+     * of xc_mmuext_op() calls.
+     */
+    nr_pins = 0;
+    for ( i = 0; i < minfo.p2m_size; i++ )
+    {
+        if ( (minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 )
+            continue;
+
+        pin[nr_pins].cmd = MMUEXT_UNPIN_TABLE;
+        pin[nr_pins].arg1.mfn = minfo.p2m_table[i];
+        nr_pins++;
+
+        if ( nr_pins == MAX_PIN_BATCH )
+        {
+            if ( xc_mmuext_op(xch, pin, nr_pins, domid) < 0 )
+            {
+                PERROR("Failed to unpin a batch of %d MFNs",
nr_pins);
+                goto out;
+            }
+            else
+                DBGPRINTF("Unpinned a batch of %d MFNs", nr_pins);
+            nr_pins = 0;
+        }
+    }
+    if ( (nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, domid) <
0) )
+    {
+        PERROR("Failed to unpin a batch of %d MFNs", nr_pins);
+        goto out;
+    }
+    else
+        DBGPRINTF("Unpinned a batch of %d MFNs", nr_pins);
+
+    /*
+     * After unpinning, we also need to remove the _PAGE_PRESENT bit from
+     * the domain''s PTEs, for the pages that we want to deallocate, or
they
+     * just could not go away.
+     */
+    for (i = 0; i < minfo.p2m_size; i++)
+    {
+        void *content;
+        xen_pfn_t table_type, table_mfn = pfn_to_mfn(i, minfo.p2m_table,
+                                                     minfo.guest_width);
+
+        if ( table_mfn == INVALID_P2M_ENTRY ||
+             minfo.pfn_type[i] == XEN_DOMCTL_PFINFO_XTAB )
+        {
+            DBGPRINTF("Broken P2M entry at PFN 0x%x", i);
+            continue;
+        }
+
+        table_type = minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
+        if ( table_type < XEN_DOMCTL_PFINFO_L1TAB ||
+             table_type > XEN_DOMCTL_PFINFO_L4TAB )
+            continue;
+
+        content = xc_map_foreign_range(xch, domid, PAGE_SIZE,
+                                       PROT_READ, table_mfn);
+        if ( !content )
+        {
+            PERROR("Failed to map the table at MFN 0x%lx",
table_mfn);
+            goto out;
+        }
+
+        /* Go through each PTE of each table and clear the _PAGE_PRESENT bit */
+        for ( j = 0; j < pte_num; j++ )
+        {
+            uint64_t pte = ((uint64_t *)content)[j];
+
+            if ( !pte || is_xen_mapping(&minfo, table_type, hvirt_start,
m2p_mfn0, content, j) )
+                continue;
+
+            if ( debug )
+                DBGPRINTF("Entry %d: PTE=0x%lx, MFN=0x%lx,
PFN=0x%lx", j, pte,
+                          (uint64_t)((pte &
MADDR_MASK_X86)>>PAGE_SHIFT),
+                          m2p_table[(unsigned long)((pte & MADDR_MASK_X86)
+                                                    >>PAGE_SHIFT)]);
+
+            pfn = m2p_table[(pte & MADDR_MASK_X86)>>PAGE_SHIFT];
+            pte &= ~_PAGE_PRESENT;
+
+            if ( xc_add_mmu_update(xch, mmu, table_mfn << PAGE_SHIFT |
+                              (j * (sizeof(uint64_t))) |
+                              MMU_PT_UPDATE_PRESERVE_AD, pte) )
+                PERROR("Failed to add some PTE update operation");
+            else
+                cleared_pte++;
+        }
+
+        if (content)
+            munmap(content, PAGE_SIZE);
+    }
+    if ( cleared_pte && xc_flush_mmu_updates(xch, mmu) )
+    {
+        PERROR("Failed flushing some PTE update operations");
+        goto out;
+    }
+    else
+        DBGPRINTF("Cleared presence for %d PTEs", cleared_pte);
+
+    /* Scan all the P2M ... */
+    while ( n < minfo.p2m_size )
+    {
+        /* ... But all operations are done in batches */
+        for ( i = 0; (i < MAX_BATCH_SIZE) && (n <
minfo.p2m_size); n++ )
+        {
+            xen_pfn_t mfn = pfn_to_mfn(n, minfo.p2m_table, minfo.guest_width);
+            xen_pfn_t mfn_type = minfo.pfn_type[n] &
XEN_DOMCTL_PFINFO_LTAB_MASK;
+
+            if (mfn == INVALID_P2M_ENTRY || !is_mapped(mfn) )
+            {
+                if ( debug )
+                    DBGPRINTF("Skipping invalid or unmapped MFN
0x%lx", mfn);
+                n_skip++;
+                continue;
+            }
+            if ( mfn_type == XEN_DOMCTL_PFINFO_BROKEN ||
+                 mfn_type == XEN_DOMCTL_PFINFO_XTAB ||
+                 mfn_type == XEN_DOMCTL_PFINFO_XALLOC )
+            {
+                if ( debug )
+                    DBGPRINTF("Skippong broken or alloc only MFN
0x%lx", mfn);
+                n_skip++;
+                continue;
+            }
+
+            /*
+            if ( gnttab_v1 ?
+                 xc_is_page_granted_v1(xch, mfn, gnttab_v1, gnt_num) :
+                 xc_is_page_granted_v2(xch, mfn, gnttab_v2, gnt_num) )
+            {
+                n_skip++;
+                continue;
+            }
+             */
+
+            old_mfns[i] = mfn;
+            batch_pfns[i] = n;
+            i++;
+        }
+
+        /* Was the batch empty? */
+        if ( i == 0)
+            continue;
+
+        /*
+         * And now the core of the whole thing: map the PFNs in the batch,
+         * backup them, allocate new pages for them, and copy them there.
+         * We do this in this order, and we pass through a local backup,
+         * because we don''t want to risk hitting the max_mem limit for
+         * the domain (which would be possible, depending on MAX_BATCH_SIZE,
+         * if we try to do it like allocate->copy->deallocate).
+         *
+         * With MAX_BATCH_SIZE of 1024 and 4K pages, this means we are moving
+         * 4MB of guest memory for each batch.
+         */
+
+        /* Map and backup */
+        old_p = xc_map_foreign_pages(xch, domid, PROT_READ, old_mfns, i);
+        if ( !old_p )
+        {
+            PERROR("Failed mapping the current MFNs\n");
+            goto out;
+        }
+        memcpy(backup, old_p, PAGE_SIZE * i);
+        munmap(old_p, PAGE_SIZE * i);
+
+        /* Deallocation and re-allocation */
+        if ( xc_domain_decrease_reservation(xch, domid, i, 0, old_mfns) != i ||
+             xc_domain_populate_physmap_exact(xch, domid, i, 0, 0, new_mfns) )
+        {
+            PERROR("Failed making space or allocating the new
MFNs\n");
+            munmap(backup, PAGE_SIZE * i);
+            goto out;
+        }
+
+        /* Map of new pages, copy content and unmap */
+        new_p = xc_map_foreign_pages(xch, domid, PROT_WRITE, new_mfns, i);
+        if ( !new_p )
+        {
+            PERROR("Failed mapping the new MFNs\n");
+            munmap(backup, PAGE_SIZE * i);
+            goto out;
+        }
+        memcpy(new_p, backup, PAGE_SIZE * i);
+        munmap(new_p, PAGE_SIZE * i);
+        munmap(backup, PAGE_SIZE * i);
+
+        /*
+         * Since we already have the new MFNs, we can update both the M2P
+         * and the P2M right here, within this same loop.
+         */
+        for ( j = 0; j < i; j++ )
+        {
+            minfo.p2m_table[batch_pfns[j]] = new_mfns[j];
+            if ( xc_add_mmu_update(xch, mmu,
+                                   (((uint64_t)new_mfns[j]) <<
PAGE_SHIFT) |
+                                   MMU_MACHPHYS_UPDATE, batch_pfns[j]) )
+            {
+                PERROR("Failed updating M2P\n");
+                goto out;
+            }
+        }
+        if ( xc_flush_mmu_updates(xch, mmu) )
+        {
+            PERROR("Failed updating M2P\n");
+            goto out;
+        }
+
+        DBGPRINTF("Batch %lu/%ld done (%lu pages skipped)",
+                  n / MAX_BATCH_SIZE, minfo.p2m_size / MAX_BATCH_SIZE, n_skip);
+    }
+
+    /*
+     * Finally (oh, well...) update the PTEs of the domain again, putting
+     * the new MFNs there, and making the entries _PAGE_PRESENT again.
+     *
+     * This is a kind-of uncanonicalization, like it happens in save-resrote,
+     * although a very special one, and we rely on the snapshot of the M2P
+     * we made before starting all the deallocation/reallocation process.
+     */
+    for ( i = 0; i < minfo.p2m_size; i++ )
+    {
+        void *content;
+        xen_pfn_t table_type, table_mfn = pfn_to_mfn(i, minfo.p2m_table,
+                                                     minfo.guest_width);
+
+        table_type = minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
+        if ( table_type < XEN_DOMCTL_PFINFO_L1TAB ||
+             table_type > XEN_DOMCTL_PFINFO_L4TAB )
+            continue;
+
+        /* We of course only care about tables */
+        content = xc_map_foreign_range(xch, domid, PAGE_SIZE,
+                                       PROT_WRITE, table_mfn);
+        if ( !content )
+        {
+            PERROR("Failed to map the table at MFN 0x%lx",
table_mfn);
+            continue;
+        }
+
+        for ( j = 0; j < PAGE_SIZE / 8; j++ )
+        {
+            uint64_t pte = ((uint64_t *)content)[j];
+
+            if ( !pte || is_xen_mapping(&minfo, table_type, hvirt_start,
m2p_mfn0, content, j) )
+                continue;
+
+            /*
+             * Basically, we lookup the PFN from the snapshoted M2P and we
+             * pick up the new MFN from the P2M (since we updated it
"live"
+             * during the re-allocation phase above).
+             */
+            mfn = (pte >> PAGE_SHIFT) & MFN_MASK_X86;
+            pfn = orig_m2p[mfn];
+
+            if ( debug )
+                DBGPRINTF("Table[PTE]: 0x%lx[%d] ==>
orig_m2p[0x%lx]=0x%lx, "
+                          "p2m[0x%lx]=0x%lx // pte: 0x%lx -->
0x%lx",
+                          table_mfn, j, mfn, pfn, pfn, minfo.p2m_table[pfn],
+                          pte,  (uint64_t)((pte & ~MADDR_MASK_X86)|
+                                          
(minfo.p2m_table[pfn]<<PAGE_SHIFT)|
+                                            _PAGE_PRESENT));
+
+            mfn = minfo.p2m_table[pfn];
+            pte &= ~MADDR_MASK_X86;
+            pte |= (uint64_t)mfn << PAGE_SHIFT;
+            pte |= _PAGE_PRESENT;
+
+            ((uint64_t *)content)[j] = pte;
+
+            if ( !MFN_IS_IN_PSEUDOPHYS_MAP(mfn, max_mfn, minfo, m2p_table) )
+            {
+                ERROR("Failed updating entry %d in table at MFN
0x%lx", j, table_mfn);
+                continue; // XXX
+            }
+        }
+
+        if ( content )
+            munmap(content, PAGE_SIZE);
+    }
+
+    DBGPRINTF("Re-pinning page table MFNs");
+
+    /* Pin the able types again */
+    nr_pins = 0;
+    for ( i = 0; i < minfo.p2m_size; i++ )
+    {
+        if ( (minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 )
+            continue;
+
+        switch ( minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        {
+        case XEN_DOMCTL_PFINFO_L1TAB:
+            pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE;
+            break;
+
+        case XEN_DOMCTL_PFINFO_L2TAB:
+            pin[nr_pins].cmd = MMUEXT_PIN_L2_TABLE;
+            break;
+
+        case XEN_DOMCTL_PFINFO_L3TAB:
+            pin[nr_pins].cmd = MMUEXT_PIN_L3_TABLE;
+            break;
+
+        case XEN_DOMCTL_PFINFO_L4TAB:
+            pin[nr_pins].cmd = MMUEXT_PIN_L4_TABLE;
+            break;
+        default:
+            continue;
+        }
+        pin[nr_pins].arg1.mfn = minfo.p2m_table[i];
+        nr_pins++;
+
+        if ( nr_pins == MAX_PIN_BATCH )
+        {
+            if ( xc_mmuext_op(xch, pin, nr_pins, domid) < 0 )
+            {
+                PERROR("Failed to pin a batch of %d MFNs", nr_pins);
+                goto out;
+            }
+            else
+                DBGPRINTF("Re-pinned a batch of %d MFNs", nr_pins);
+            nr_pins = 0;
+        }
+    }
+    if ( (nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, domid) <
0) )
+    {
+        PERROR("Failed to pin batch of %d page tables", nr_pins);
+        goto out;
+    }
+    else
+        DBGPRINTF("Re-pinned a batch of %d MFNs", nr_pins);
+
+    /*
+     * Now, take care of the vCPUs contextes. It all happens as above,
+     * we use the original M2P and the new domain''s P2M to update all
+     * the various references.
+     */
+    for ( i = 0; i <= info.max_vcpu_id; i++ )
+    {
+        xc_vcpuinfo_t vinfo;
+
+        DBGPRINTF("Adjusting context for VCPU%d", i);
+
+        if ( xc_vcpu_getinfo(xch, domid, i, &vinfo) )
+        {
+            PERROR("Failed getting info for VCPU%d", i);
+            goto out;
+        }
+        if ( !vinfo.online )
+        {
+            DBGPRINTF("VCPU%d seems offline", i);
+            continue;
+        }
+
+        if ( xc_vcpu_getcontext(xch, domid, i, &ctxt) )
+        {
+            PERROR("No context for VCPU%d", i);
+            goto out;
+        }
+
+        if ( i == 0 )
+        {
+            //start_info_any_t *start_info;
+
+            /*
+             * Update the start info frame number. It is the 3rd argument
+             * to the HYPERVISOR_sched_op hypercall when op is
+             * SCHEDOP_shutdown and reason is SHUTDOWN_suspend, so we find
+             * it in EDX.
+             */
+            mfn = GET_FIELD(&ctxt, user_regs.edx);
+            mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)];
+            SET_FIELD(&ctxt, user_regs.edx, mfn);
+
+            /*
+             * XXX: I checke, and store_mfn and console_mfn seemed ok, at
+             *      least from a ''mapping'' point of view, but
more testing is
+             *      needed.
+            start_info = xc_map_foreign_range(xch, domid, PAGE_SIZE, PROT_READ
| PROT_WRITE, mfn);
+            munmap(start_info, PAGE_SIZE);
+             */
+        }
+
+        /* GDT pointing MFNs */
+        for ( j = 0; (512*j) < GET_FIELD(&ctxt, gdt_ents); j++ )
+        {
+            mfn = GET_FIELD(&ctxt, gdt_frames[j]);
+            mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)];
+            SET_FIELD(&ctxt, gdt_frames[j], mfn);
+        }
+
+        /* CR3 XXX: PAE needs special attenion here, I think */
+        mfn = UNFOLD_CR3(GET_FIELD(&ctxt, ctrlreg[3]));
+        mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)];
+        SET_FIELD(&ctxt, ctrlreg[3], FOLD_CR3(mfn));
+
+        /* Guest pagetable (x86/64) in CR1 */
+        if ( (minfo.pt_levels == 4) && ctxt.x64.ctrlreg[1] )
+        {
+            /*
+             * XXX: save-restore code mangle with the least-significant
+             *      bit (''valid PFN''). This should not be
needed in here.
+             */
+            mfn = UNFOLD_CR3(ctxt.x64.ctrlreg[1]);
+            mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)];
+            ctxt.x64.ctrlreg[1] = FOLD_CR3(mfn);
+        }
+
+        /*
+         * XXX: Xen refuses to set a new context for an existing vCPU if
+         *      things like CR3, the GDTs have changed, even if the domain
+         *      is suspended. Going through re-initializing the vCPU (by
+         *      this one call below with a NULL ctxt) makes it possible,
+         *      but is that sensible? And even if yes, is that the following
+         *      _setcontext call issued below enough?
+         */
+        if ( xc_vcpu_setcontext(xch, domid, i, NULL) )
+        {
+            PERROR("Failed re-initialising VCPU%d", i);
+            goto out;
+        }
+        if ( xc_vcpu_setcontext(xch, domid, i, &ctxt) )
+        {
+            PERROR("Failed when updating context for VCPU%d", i);
+            goto out;
+        }
+    }
+
+    /*
+     * Finally (an this time for real), we take care of the pages mapping
+     * the P2M, and of the P2M entries themselves.
+     */
+
+    live_shinfo = xc_map_foreign_range(xch, domid,
+                     PAGE_SIZE, PROT_READ|PROT_WRITE, info.shared_info_frame);
+    if ( !live_shinfo )
+    {
+        PERROR("Failed mapping live_shinfo");
+        goto out;
+    }
+
+    fll = GET_FIELD(live_shinfo, arch.pfn_to_mfn_frame_list_list);
+    fll = minfo.p2m_table[mfn_to_pfn(fll, orig_m2p)];
+    live_p2m_frame_list_list = xc_map_foreign_range(xch, domid, PAGE_SIZE,
+                                                    PROT_READ|PROT_WRITE, fll);
+    if ( !live_p2m_frame_list_list )
+    {
+        PERROR("Couldn''t map live_p2m_frame_list_list");
+        goto out;
+    }
+    SET_FIELD(live_shinfo, arch.pfn_to_mfn_frame_list_list, fll);
+
+    /* First, update the frames caontaining the list of the P2M frames */
+    for ( i = 0; i < P2M_FLL_ENTRIES; i++ )
+    {
+
+        mfn = ((uint64_t *)live_p2m_frame_list_list)[i];
+        mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)];
+        ((uint64_t *)live_p2m_frame_list_list)[i] = mfn;
+    }
+
+    live_p2m_frame_list +        xc_map_foreign_pages(xch, domid,
PROT_READ|PROT_WRITE,
+                             live_p2m_frame_list_list,
+                             P2M_FLL_ENTRIES);
+    if ( !live_p2m_frame_list )
+    {
+        PERROR("Couldn''t map live_p2m_frame_list");
+        goto out;
+    }
+
+    /* And then update the actual entries of it */
+    for ( i = 0; i < P2M_FL_ENTRIES; i++ )
+    {
+        mfn = ((uint64_t *)live_p2m_frame_list)[i];
+        mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)];
+        ((uint64_t *)live_p2m_frame_list)[i] = mfn;
+    }
+
+    rc = 0;
+
+ out:
+    if ( live_p2m_frame_list_list )
+        munmap(live_p2m_frame_list_list, PAGE_SIZE);
+    if ( live_p2m_frame_list )
+        munmap(live_p2m_frame_list, P2M_FLL_ENTRIES * PAGE_SIZE);
+    if ( live_shinfo )
+        munmap(live_shinfo, PAGE_SIZE);
+
+    free(mmu);
+    free(new_mfns);
+    free(old_mfns);
+    free(batch_pfns );
+    free(backup);
+    free(orig_m2p);
+
+    /*
+    if (gnttab_v1)
+        munmap(gnttab_v1, gnt_num / (PAGE_SIZE/sizeof(grant_entry_v1_t)));
+    if (gnttab_v2)
+        munmap(gnttab_v2, gnt_num / (PAGE_SIZE/sizeof(grant_entry_v2_t)));
+     */
+
+    xc_unmap_domain_meminfo(xch, &minfo);
+    munmap(m2p_table, M2P_SIZE(max_mfn));
+
+    return !!rc;
+}
diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h
--- a/tools/libxc/xenguest.h
+++ b/tools/libxc/xenguest.h
@@ -272,6 +272,15 @@ int xc_query_page_offline_status(xc_inte
 
 int xc_exchange_page(xc_interface *xch, int domid, xen_pfn_t mfn);
 
+/**
+ * This function deallocates all the guests memory and allocates it
+ * again and immediately, with the net effect of moving it somewhere
+ * else wrt where it is when the function is invoked.
+ *
+ * @param xch a handle to an open hypervisor interface.
+ * @param domid the domain id one wants to move the memory of.
+ */
+int xc_domain_move_memory(xc_interface *xch, uint32_t domid/*, int hvm*/);
 
 /**
  * Memory related information, such as PFN types, the P2M table,
diff --git a/tools/libxc/xg_private.h b/tools/libxc/xg_private.h
--- a/tools/libxc/xg_private.h
+++ b/tools/libxc/xg_private.h
@@ -145,6 +145,11 @@ static inline xen_pfn_t pfn_to_mfn(xen_p
                             (((uint32_t *)p2m)[(pfn)]))));
 }
 
+static inline xen_pfn_t mfn_to_pfn(xen_pfn_t mfn, xen_pfn_t *m2p)
+{
+    return m2p[mfn];
+}
+
 /* Number of xen_pfn_t in a page */
 #define FPP             (PAGE_SIZE/(dinfo->guest_width))

Dario Faggioli

2013-Apr-09 02:49 UTC

head link

[PATCH 7 of 8 [RFC]] libxl: introduce libxl_domain_move_memory

and make use of it in `xl node-affinity'' (introduced by one of
the previous patches). This way, users of both, library and cmd
line can ask deallocation and reallocation of the whole memory
pages belonging to a given domain.

In `xl node-affinity'', if the ''-M'' option is used,
the above
happens right after changing the NUMA node affinity of the domain
itself. This produces the move of the domain from the (set of)
NUMA node(s) where it belongs now, to somewhere else.

In libxl, what was needed was a way of requesting for domain
suspension, without triggering the whole save/send part, and
avoiding its (potential) asynchronicity. Perhaps, in future,
and if we want that, the _whole_ suspend+move operation can
be made asynchronous, but not the single pieces of it. That is
achieved by calling directly the core of the suspend routine,
with a simplified save context. Of course, this brings no
change to the actual suspend/resume and save/restore
behaviour, wrt the current one.

XXX: Update man pages and documentation still to be done.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -853,6 +853,81 @@ int libxl_domain_unpause(libxl_ctx *ctx,
     return rc;
 }
 
+/*
+ * This function will suspend the domain and invoke xc_domain_move_memory().
+ * The xc_ call will deallocate and reallocate all the memory of the domain.
+ * Notice that this function _does_not_ resume the domain on its own, and
+ * that needs to be done manually by the caller.
+ *
+ * This means that, if the allocation policy (e.g., the node affinity)
+ * for the domain changed (right) before calling this function, at the
+ * end of it all the domain''s memory will be compliant with that
policy.
+ */
+int libxl_move_memory(libxl_ctx *ctx, uint32_t domid)
+{
+    GC_INIT(ctx);
+    libxl__domain_suspend_state *dss;
+    int rc = 0;
+
+    libxl_domain_type type = libxl__domain_type(gc, domid);
+    if (type == LIBXL_DOMAIN_TYPE_INVALID)
+        abort();
+
+    /*
+     * First of all, we need to suspend the domain. We use the core
+     * suspen code from libxl_dom.c, calling it directly instea of having
+     * libxc calling back into it. For that reason, we need a dss, although
+     * only some of the fields are relevant in our case.
+     */
+    GCNEW(dss);
+    dss->domid = domid;
+    dss->hvm = type == LIBXL_DOMAIN_TYPE_HVM;
+    dss->suspend_eventchn = -1;
+    dss->guest_responded = 0;
+    dss->dm_savefile = NULL;
+
+     /* Try to initialize the suspend event channel */
+    dss->xce = xc_evtchn_open(NULL, 0);
+    if (dss->xce == NULL) {
+        rc = ERROR_FAIL;
+        goto out;
+    } else {
+        int port = xs_suspend_evtchn_port(domid);
+
+        if (port >= 0) {
+            dss->suspend_eventchn +               
xc_suspend_evtchn_init(ctx->xch, dss->xce, domid, port);
+
+            if (dss->suspend_eventchn < 0)
+                LOG(WARN, "Suspend event channel initialization
failed");
+        }
+    }
+
+    if (!libxl__do_domain_suspend(gc, dss))
+    {
+        rc = ERROR_GUEST_TIMEDOUT;
+        goto out;
+    }
+
+    rc = xc_domain_move_memory(ctx->xch, domid);
+    if (rc) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "moving memory");
+        rc = ERROR_BADFAIL;
+        goto out;
+    }
+
+ out:
+    /* Tear down the suspend event channel (if successfully initialized) */
+    if (dss->suspend_eventchn > 0)
+        xc_suspend_evtchn_release(CTX->xch, dss->xce, domid,
+                                  dss->suspend_eventchn);
+    if (dss->xce != NULL)
+        xc_evtchn_close(dss->xce);
+
+    GC_FREE;
+    return rc;
+}
+
 int libxl__domain_pvcontrol_available(libxl__gc *gc, uint32_t domid)
 {
     libxl_ctx *ctx = libxl__gc_owner(gc);
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -950,6 +950,8 @@ int libxl_flask_getenforce(libxl_ctx *ct
 int libxl_flask_setenforce(libxl_ctx *ctx, int mode);
 int libxl_flask_loadpolicy(libxl_ctx *ctx, void *policy, uint32_t size);
 
+int libxl_move_memory(libxl_ctx *ctx, uint32_t domid);
+
 /* misc */
 
 /* Each of these sets or clears the flag according to whether the
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -991,11 +991,8 @@ int libxl__domain_resume_device_model(li
     return 0;
 }
 
-int libxl__domain_suspend_common_callback(void *user)
+int libxl__do_domain_suspend(libxl__gc *gc, libxl__domain_suspend_state *dss)
 {
-    libxl__save_helper_state *shs = user;
-    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
-    STATE_AO_GC(dss->ao);
     unsigned long hvm_s_state = 0, hvm_pvdrv = 0;
     int ret;
     char *state = "suspend";
@@ -1125,6 +1122,16 @@ int libxl__domain_suspend_common_callbac
     return 1;
 }
 
+
+int libxl__domain_suspend_common_callback(void *user)
+{
+    libxl__save_helper_state *shs = user;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
+    STATE_AO_GC(dss->ao);
+
+    return libxl__do_domain_suspend(gc, dss);
+}
+
 static inline char *physmap_path(libxl__gc *gc, uint32_t domid,
         char *phys_offset, char *node)
 {
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2541,6 +2541,9 @@ struct libxl__domain_create_state {
 void libxl__xc_domain_saverestore_async_callback_done(libxl__egc *egc,
                            libxl__save_helper_state *shs, int return_value);
 
+_hidden int libxl__do_domain_suspend(libxl__gc *gc,
+                                     libxl__domain_suspend_state *dss);
+
 _hidden int libxl__domain_suspend_common_callback(void *data);
 _hidden void libxl__domain_suspend_common_switch_qemu_logdirty
                                (int domid, unsigned int enable, void *data);
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -4654,13 +4654,31 @@ static void nodeaffinity(uint32_t domid,
 
 int main_nodeaffinity(int argc, char **argv)
 {
-    int opt;
-
-    SWITCH_FOREACH_OPT(opt, "", NULL, "node-affinity", 2) {
-        /* No options */
+    int movemem = 0;
+    int opt = 0;
+
+    SWITCH_FOREACH_OPT(opt, "M", NULL, "node-affinity", 2)
{
+    case ''M'':
+        movemem = 1;
+        break;
+    }
+
+    if (argc - optind > 3) {
+        help("nodeaffinity");
+        return 2;
     }
 
     nodeaffinity(find_domain(argv[optind]), argv[optind+1]);
+
+    if (movemem) {
+        int rc = libxl_move_memory(ctx, find_domain(argv[optind]));
+
+        if (rc < 0)
+            fprintf(stderr, "Failed to save domain, trying to
resume\n");
+
+        libxl_domain_resume(ctx, find_domain(argv[optind]), 1, NULL);
+    }
+
     return 0;
 }
 
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -217,7 +217,10 @@ struct cmd_spec cmd_table[] = {
     { "node-affinity",
       &main_nodeaffinity, 0, 1,
       "Set the NUMA node affinity for the domain",
-      "<Domain> [<NODEs|all|none>]",
+      "[options] <Domain> [<NODEs|all|none>]",
+      "-M        Move the memory to th nodes corresponding to CPUs\n"
+      "          (involves suspending/resuming the domain, so some\n"
+      "          downtime is to be expected)",
     },
     { "vcpu-set",
       &main_vcpuset, 0, 1,

Dario Faggioli

2013-Apr-09 02:49 UTC

head link

[PATCH 8 of 8 [RFC]] tools/misc: introduce xen-mfndump

It''s a little tool, useful when trying to figure out what goes
on in both the host and the guests memory, i.e., stuff like
MFN to PFN mappings, MFN/PFN mappings in a guest''s PTEs, etc.

This is what it does as of now:

$ /usr/sbin/xen-mfndump
Usage: xen-mfndump <command> [args]
Commands:
  help                      show this help
  dump-m2p                  show M2P
  dump-p2m    <domid>       show P2M of <domid>
  dump-ptes   <domid> <mfn> show the PTEs in <mfn>
  lookup-pte  <domid> <mfn> find the PTE mapping <mfn>
  memcmp-mfns <domid1> <mfn1> <domid2> <mfn2>
                            (str)compare content of <mfn1> &
<mfn2>

It''s probably far from perfect, but it reveals quite useful when
debugging the kind of issues introduced by this series.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/misc/Makefile b/tools/misc/Makefile
--- a/tools/misc/Makefile
+++ b/tools/misc/Makefile
@@ -10,7 +10,7 @@ CFLAGS += $(CFLAGS_libxenstore)
 HDRS     = $(wildcard *.h)
 
 TARGETS-y := xenperf xenpm xen-tmem-list-parse gtraceview gtracestat
xenlockprof xenwatchdogd xencov
-TARGETS-$(CONFIG_X86) += xen-detect xen-hvmctx xen-hvmcrash xen-lowmemd
+TARGETS-$(CONFIG_X86) += xen-detect xen-hvmctx xen-hvmcrash xen-lowmemd
xen-mfndump
 TARGETS-$(CONFIG_MIGRATE) += xen-hptool
 TARGETS := $(TARGETS-y)
 
@@ -24,7 +24,7 @@ INSTALL_BIN := $(INSTALL_BIN-y)
 
 INSTALL_SBIN-y := xm xen-bugtool xen-python-path xend xenperf xsview xenpm
xen-tmem-list-parse gtraceview \
 	gtracestat xenlockprof xenwatchdogd xen-ringwatch xencov
-INSTALL_SBIN-$(CONFIG_X86) += xen-hvmctx xen-hvmcrash xen-lowmemd
+INSTALL_SBIN-$(CONFIG_X86) += xen-hvmctx xen-hvmcrash xen-lowmemd xen-mfndump
 INSTALL_SBIN-$(CONFIG_MIGRATE) += xen-hptool
 INSTALL_SBIN := $(INSTALL_SBIN-y)
 
@@ -77,6 +77,9 @@ xenlockprof: xenlockprof.o
 xen-hptool: xen-hptool.o
 	$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(LDLIBS_libxenguest)
$(LDLIBS_libxenstore) $(APPEND_LDFLAGS)
 
+xen-mfndump: xen-mfndump.o
+	$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(LDLIBS_libxenguest)
$(LDLIBS_libxenstore) $(APPEND_LDFLAGS)
+
 xenwatchdogd: xenwatchdogd.o
 	$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS)
 
diff --git a/tools/misc/xen-mfndump.c b/tools/misc/xen-mfndump.c
new file mode 100644
--- /dev/null
+++ b/tools/misc/xen-mfndump.c
@@ -0,0 +1,425 @@
+#include <xenctrl.h>
+#include <xc_private.h>
+#include <xc_core.h>
+#include <errno.h>
+#include <unistd.h>
+
+#include "xg_save_restore.h"
+
+#define ARRAY_SIZE(a) (sizeof (a) / sizeof ((a)[0]))
+
+static xc_interface *xch;
+
+int help_func(int argc, char *argv[])
+{
+    fprintf(stderr,
+            "Usage: xen-mfndump <command> [args]\n"
+            "Commands:\n"
+            "  help                      show this help\n"
+            "  dump-m2p                  show M2P\n"
+            "  dump-p2m    <domid>       show P2M of
<domid>\n"
+            "  dump-ptes   <domid> <mfn> show the PTEs in
<mfn>\n"
+            "  lookup-pte  <domid> <mfn> find the PTE mapping
<mfn>\n"
+            "  memcmp-mfns <domid1> <mfn1> <domid2>
<mfn2>\n"
+            "                            compare content of <mfn1>
& <mfn2>\n"
+           );
+
+    return 0;
+}
+
+int dump_m2p_func(int argc, char *argv[])
+{
+    unsigned long i, max_mfn;
+    xen_pfn_t *m2p_table;
+
+    if ( argc > 0 )
+    {
+        help_func(0, NULL);
+        return 1;
+    }
+
+    /* Map M2P and obtain gpfn */
+    max_mfn = xc_maximum_ram_page(xch);
+    if ( !(m2p_table = xc_map_m2p(xch, max_mfn, PROT_READ, NULL)) )
+    {
+        ERROR("Failed to map live M2P table");
+        return -1;
+    }
+
+    printf(" --- Dumping M2P ---\n");
+    printf(" Max MFN: %lu\n", max_mfn);
+    for ( i = 0; i < max_mfn; i++ )
+    {
+        printf("  mfn=0x%lx ==> pfn=0x%lx\n", i, m2p_table[i]);
+    }
+    printf(" --- End of M2P ---\n");
+
+    munmap(m2p_table, M2P_SIZE(max_mfn));
+    return 0;
+}
+
+int dump_p2m_func(int argc, char *argv[])
+{
+    struct xc_domain_meminfo minfo;
+    xc_dominfo_t info;
+    unsigned long i;
+    int domid;
+
+    if ( argc < 1 )
+    {
+        help_func(0, NULL);
+        return 1;
+    }
+    domid = atoi(argv[0]);
+
+    if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 ||
+         info.domid != domid )
+    {
+        ERROR("Failed to obtain info for domain %d\n", domid);
+        return -1;
+    }
+
+    /* Retrieve all the info about the domain''s memory */
+    memset(&minfo, 0, sizeof(minfo));
+    if ( xc_map_domain_meminfo(xch, domid, &minfo) )
+    {
+        ERROR("Could not map domain %d memory information\n", domid);
+        return -1;
+    }
+
+    printf(" --- Dumping P2M for domain %d ---\n", domid);
+    printf(" Guest Width: %u, PT Levels: %u P2M size: = %lu\n",
+           minfo.guest_width, minfo.pt_levels, minfo.p2m_size);
+    for ( i = 0; i < minfo.p2m_size; i++ )
+    {
+        unsigned long pagetype = minfo.pfn_type[i] &
+                                     XEN_DOMCTL_PFINFO_LTAB_MASK;
+
+        printf("  pfn=0x%lx ==> mfn=0x%lx (type 0x%lx)", i,
minfo.p2m_table[i],
+               pagetype >> XEN_DOMCTL_PFINFO_LTAB_SHIFT);
+
+        if ( is_mapped(minfo.p2m_table[i]) )
+            printf(" [mapped]");
+
+        if ( pagetype & XEN_DOMCTL_PFINFO_LPINTAB )
+            printf (" [pinned]");
+
+        if ( pagetype == XEN_DOMCTL_PFINFO_XTAB )
+            printf(" [xtab]");
+        if ( pagetype == XEN_DOMCTL_PFINFO_BROKEN )
+            printf(" [broken]");
+        if ( pagetype == XEN_DOMCTL_PFINFO_XALLOC )
+            printf( " [xalloc]");
+
+        switch ( pagetype & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        {
+            case XEN_DOMCTL_PFINFO_L1TAB:
+                printf(" L1 table");
+                break;
+
+            case XEN_DOMCTL_PFINFO_L2TAB:
+                printf(" L2 table");
+                break;
+
+            case XEN_DOMCTL_PFINFO_L3TAB:
+                printf(" L3 table");
+                break;
+
+            case XEN_DOMCTL_PFINFO_L4TAB:
+                printf(" L4 table");
+                break;
+        }
+
+        printf("\n");
+    }
+    printf(" --- End of P2M for domain %d ---\n", domid);
+
+    xc_unmap_domain_meminfo(xch, &minfo);
+    return 0;
+}
+
+int dump_ptes_func(int argc, char *argv[])
+{
+    struct xc_domain_meminfo minfo;
+    xc_dominfo_t info;
+    void *page = NULL;
+    unsigned long i, max_mfn;
+    int domid, pte_num, rc = 0;
+    xen_pfn_t pfn, mfn, *m2p_table;
+
+    if ( argc < 2 )
+    {
+        help_func(0, NULL);
+        return 1;
+    }
+    domid = atoi(argv[0]);
+    mfn = strtoul(argv[1], NULL, 16);
+
+    if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 ||
+         info.domid != domid )
+    {
+        ERROR("Failed to obtain info for domain %d\n", domid);
+        return -1;
+    }
+
+    /* Retrieve all the info about the domain''s memory */
+    memset(&minfo, 0, sizeof(minfo));
+    if ( xc_map_domain_meminfo(xch, domid, &minfo) )
+    {
+        ERROR("Could not map domain %d memory information\n", domid);
+        return -1;
+    }
+
+    /* Map M2P and obtain gpfn */
+    max_mfn = xc_maximum_ram_page(xch);
+    if ( (mfn > max_mfn) ||
+         !(m2p_table = xc_map_m2p(xch, max_mfn, PROT_READ, NULL)) )
+    {
+        xc_unmap_domain_meminfo(xch, &minfo);
+        ERROR("Failed to map live M2P table");
+        return -1;
+    }
+
+    pfn = m2p_table[mfn];
+    if ( pfn >= minfo.p2m_size )
+    {
+        ERROR("pfn 0x%lx out of range for domain %d\n", pfn, domid);
+        rc = -1;
+        goto out;
+    }
+
+    if ( !(minfo.pfn_type[pfn] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK) )
+    {
+        ERROR("pfn 0x%lx for domain %d is not a PT\n", pfn, domid);
+        rc = -1;
+        goto out;
+    }
+
+    page = xc_map_foreign_range(xch, domid, PAGE_SIZE, PROT_READ,
+                                minfo.p2m_table[pfn]);
+    if ( !page )
+    {
+        ERROR("Failed to map 0x%lx\n", minfo.p2m_table[pfn]);
+        rc = -1;
+        goto out;
+    }
+
+    pte_num = PAGE_SIZE / 8;
+
+    printf(" --- Dumping %d PTEs for domain %d ---\n", pte_num,
domid);
+    printf(" Guest Width: %u, PT Levels: %u P2M size: = %lu\n",
+           minfo.guest_width, minfo.pt_levels, minfo.p2m_size);
+    printf(" pfn: 0x%lx, mfn: 0x%lx",
+           pfn, minfo.p2m_table[pfn]);
+    switch ( minfo.pfn_type[pfn] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+    {
+        case XEN_DOMCTL_PFINFO_L1TAB:
+            printf(", L1 table");
+            break;
+        case XEN_DOMCTL_PFINFO_L2TAB:
+            printf(", L2 table");
+            break;
+        case XEN_DOMCTL_PFINFO_L3TAB:
+            printf(", L3 table");
+            break;
+        case XEN_DOMCTL_PFINFO_L4TAB:
+            printf(", L4 table");
+            break;
+    }
+    if ( minfo.pfn_type[pfn] & XEN_DOMCTL_PFINFO_LPINTAB )
+        printf (" [pinned]");
+    if ( is_mapped(minfo.p2m_table[pfn]) )
+        printf(" [mapped]");
+    printf("\n");
+
+    for ( i = 0; i < pte_num; i++ )
+        printf("  pte[%lu]: 0x%lx\n", i, ((const uint64_t*)page)[i]);
+
+    printf(" --- End of PTEs for domain %d, pfn=0x%lx (mfn=0x%lx)
---\n",
+           domid, pfn, minfo.p2m_table[pfn]);
+
+ out:
+    munmap(page, PAGE_SIZE);
+    xc_unmap_domain_meminfo(xch, &minfo);
+    munmap(m2p_table, M2P_SIZE(max_mfn));
+    return rc;
+}
+
+int lookup_pte_func(int argc, char *argv[])
+{
+    struct xc_domain_meminfo minfo;
+    xc_dominfo_t info;
+    void *page = NULL;
+    unsigned long i, j;
+    int domid, pte_num;
+    xen_pfn_t mfn;
+
+    if ( argc < 2 )
+    {
+        help_func(0, NULL);
+        return 1;
+    }
+    domid = atoi(argv[0]);
+    mfn = strtoul(argv[1], NULL, 16);
+
+    if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 ||
+         info.domid != domid )
+    {
+        ERROR("Failed to obtain info for domain %d\n", domid);
+        return -1;
+    }
+
+    /* Retrieve all the info about the domain''s memory */
+    memset(&minfo, 0, sizeof(minfo));
+    if ( xc_map_domain_meminfo(xch, domid, &minfo) )
+    {
+        ERROR("Could not map domain %d memory information\n", domid);
+        return -1;
+    }
+
+    pte_num = PAGE_SIZE / 8;
+
+    printf(" --- Lookig for PTEs mapping mfn 0x%lx for domain %d
---\n",
+           mfn, domid);
+    printf(" Guest Width: %u, PT Levels: %u P2M size: = %lu\n",
+           minfo.guest_width, minfo.pt_levels, minfo.p2m_size);
+
+    for ( i = 0; i < minfo.p2m_size; i++ )
+    {
+        if ( !(minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK) )
+            continue;
+
+        page = xc_map_foreign_range(xch, domid, PAGE_SIZE, PROT_READ,
+                                    minfo.p2m_table[i]);
+        if ( !page )
+            continue;
+
+        for ( j = 0; j < pte_num; j++ )
+        {
+            uint64_t pte = ((const uint64_t*)page)[j];
+
+#define __MADDR_BITS_X86  ((minfo.guest_width == 8) ? 52 : 44)
+#define __MFN_MASK_X86    ((1ULL << (__MADDR_BITS_X86 - PAGE_SHIFT_X86))
- 1)
+            if ( ((pte >> PAGE_SHIFT_X86) & __MFN_MASK_X86) == mfn)
+                printf("  0x%lx <-- [0x%lx][%lu]: 0x%lx\n",
+                       mfn, minfo.p2m_table[i], j, pte);
+#undef __MADDR_BITS_X86
+#undef __MFN_MASK_X8
+        }
+
+        munmap(page, PAGE_SIZE);
+        page = NULL;
+    }
+
+    xc_unmap_domain_meminfo(xch, &minfo);
+
+    return 1;
+}
+
+int memcmp_mfns_func(int argc, char *argv[])
+{
+    xc_dominfo_t info1, info2;
+    void *page1 = NULL, *page2 = NULL;
+    int domid1, domid2;
+    xen_pfn_t mfn1, mfn2;
+    int rc = 0;
+
+    if ( argc < 4 )
+    {
+        help_func(0, NULL);
+        return 1;
+    }
+    domid1 = atoi(argv[0]);
+    domid2 = atoi(argv[2]);
+    mfn1 = strtoul(argv[1], NULL, 16);
+    mfn2 = strtoul(argv[3], NULL, 16);
+
+    if ( xc_domain_getinfo(xch, domid1, 1, &info1) != 1 ||
+         xc_domain_getinfo(xch, domid2, 1, &info2) != 1 ||
+         info1.domid != domid1 || info2.domid != domid2)
+    {
+        ERROR("Failed to obtain info for domains\n");
+        return -1;
+    }
+
+    page1 = xc_map_foreign_range(xch, domid1, PAGE_SIZE, PROT_READ, mfn1);
+    page2 = xc_map_foreign_range(xch, domid2, PAGE_SIZE, PROT_READ, mfn2);
+    if ( !page1 || !page2 )
+    {
+        ERROR("Failed to map either 0x%lx[dom %d] or 0x%lx[dom
%d]\n",
+              mfn1, domid1, mfn2, domid2);
+        rc = -1;
+        goto out;
+    }
+
+    printf(" --- Comparing the content of 2 MFNs ---\n");
+    printf(" 1: 0x%lx[dom %d], 2: 0x%lx[dom %d]\n",
+           mfn1, domid1, mfn2, domid2);
+    printf("  memcpy(1, 2) = %d\n", memcmp(page1, page2, PAGE_SIZE));
+
+ out:
+    munmap(page1, PAGE_SIZE);
+    munmap(page2, PAGE_SIZE);
+    return rc;
+}
+
+
+
+struct {
+    const char *name;
+    int (*func)(int argc, char *argv[]);
+} opts[] = {
+    { "help", help_func },
+    { "dump-m2p", dump_m2p_func },
+    { "dump-p2m", dump_p2m_func },
+    { "dump-ptes", dump_ptes_func },
+    { "lookup-pte", lookup_pte_func },
+    { "memcmp-mfns", memcmp_mfns_func},
+};
+
+int main(int argc, char *argv[])
+{
+    int i, ret;
+
+    if (argc < 2)
+    {
+        help_func(0, NULL);
+        return 1;
+    }
+
+    xch = xc_interface_open(0, 0, 0);
+    if ( !xch )
+    {
+        ERROR("Failed to open an xc handler");
+        return 1;
+    }
+
+    for ( i = 0; i < ARRAY_SIZE(opts); i++ )
+    {
+        if ( !strncmp(opts[i].name, argv[1], strlen(argv[1])) )
+            break;
+    }
+
+    if ( i == ARRAY_SIZE(opts) )
+    {
+        fprintf(stderr, "Unknown option ''%s''",
argv[1]);
+        help_func(0, NULL);
+        return 1;
+    }
+
+    ret = opts[i].func(argc - 2, argv + 2);
+
+    xc_interface_close(xch);
+
+    return !!ret;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-set-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */

Juergen Gross

2013-Apr-09 05:23 UTC

head link

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

On 09.04.2013 04:49, Dario Faggioli wrote:> as a mechanism of deallocating and reallocating (immediately!) _all_
> the memory of a domain. Notice it relies on the guest being suspended
> already, before the function is invoked.
Is this solution intended to be the final one?

This might be okay for a domain with less than 1GB of memory, but I see
problems for really huge domains. The needed time to copy the memory might
result in long offline times. For this case something like live migration
(optional?) would be a better solution, I think.

Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

Dario Faggioli

2013-Apr-09 06:56 UTC

head link

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

On mar, 2013-04-09 at 06:23 +0100, Juergen Gross wrote:> On 09.04.2013 04:49, Dario Faggioli wrote:
> > as a mechanism of deallocating and reallocating (immediately!) _all_
> > the memory of a domain. Notice it relies on the guest being suspended
> > already, before the function is invoked.
> 
> Is this solution intended to be the final one?
> Well, the idea of sharing the patches, even at this stage, was right
about discussing that! :-P
> This might be okay for a domain with less than 1GB of memory, but I see
> problems for really huge domains. The needed time to copy the memory might
> result in long offline times. 
>I see what you mean. I thought about approaches that copy only a
specific part of the memory, perhaps according to some usage statistics.

I''ve not yet abandoned that idea, but I honestly think that, if we go
through the suspend-copy-resume (which is pretty much the only thing I
can do with PV guests, isn''t it?), that can''t be for a page or
two, or
the impact of the overhead would be even higher!
> For this case something like live migration
> (optional?) would be a better solution, I think.
> Well, I thought about that too, and I''m open to
thinking/discussing/hearing suggestions about how to implement a "live
phase" for this.

The problem is, with a more migration-alike approach, I''ll end up
doubling the memory requirements of, potentially, all the domains (since
I''d need space for storing the full RAM image of each one!), which I
don''t think it is an acceptable requirement either, _especially_ for
big
guests, is it? :-(

Thanks for you interest, :-)
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Juergen Gross

2013-Apr-09 08:13 UTC

head link

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

On 09.04.2013 08:56, Dario Faggioli wrote:> On mar, 2013-04-09 at 06:23 +0100, Juergen Gross wrote:
>> On 09.04.2013 04:49, Dario Faggioli wrote:
>>> as a mechanism of deallocating and reallocating (immediately!)
_all_
>>> the memory of a domain. Notice it relies on the guest being
suspended
>>> already, before the function is invoked.
>>
>> Is this solution intended to be the final one?
>>
> Well, the idea of sharing the patches, even at this stage, was right
> about discussing that! :-P
>
>> This might be okay for a domain with less than 1GB of memory, but I see
>> problems for really huge domains. The needed time to copy the memory
might
>> result in long offline times.
>>
> I see what you mean. I thought about approaches that copy only a
> specific part of the memory, perhaps according to some usage statistics.
>
> I''ve not yet abandoned that idea, but I honestly think that, if we
go
> through the suspend-copy-resume (which is pretty much the only thing I
> can do with PV guests, isn''t it?), that can''t be for a
page or two, or
> the impact of the overhead would be even higher!
>
>> For this case something like live migration
>> (optional?) would be a better solution, I think.
>>
> Well, I thought about that too, and I''m open to
> thinking/discussing/hearing suggestions about how to implement a "live
> phase" for this.
>
> The problem is, with a more migration-alike approach, I''ll end up
> doubling the memory requirements of, potentially, all the domains (since
> I''d need space for storing the full RAM image of each one!), which
I
> don''t think it is an acceptable requirement either, _especially_
for big
> guests, is it? :-(
What about the following approach:

- do the migration in chunks (like 1GB, may be configurable)
- don''t move pages which are already on one of the target nodes
- try to allocate memory on the target node while the domain is still running.
   If this fails, there is no need to move that chunk. Depending on the page
   size requirements (huge pages) decide whether the move is aborted or done
   partially.
- in case of successful allocation suspend the domain, do the copy and update
   page tables for the copied pages, then resume the domain
- free the memory chunk on the old node(s)
- repeat until either no memory obtained or move is finished

This will have higher overhead, but the domain will be suspended for only
short periods of time. The memory requirements don''t matter, as the
additional
memory will be allocated only for a short period of time. Additionally this
approach is more secure, as the domain can''t end in suspended state
without
memory (you don''t have to avoid ballooning or creation of other domains
during
the move).


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

Dario Faggioli

2013-Apr-09 08:51 UTC

head link

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

On mar, 2013-04-09 at 09:13 +0100, Juergen Gross wrote:> What about the following approach:
> In general, I like it... More details below.
> - do the migration in chunks (like 1GB, may be configurable)
>Yes, provided these chunks are big enough, I think the overhead of is
acceptable.
> - don''t move pages which are already on one of the target nodes
>Yep, that is definitely sane, and was already on my TODO list (although,
you''re right, I forgot to mention it in the cover or in the various
changelogs). It''s not there yet because I''m missing a way of
knowing on
what node a page is, but I''m already working of putting it together.

Anyway, I agree on this too, and thanks for pointing that out. :-)
> - try to allocate memory on the target node while the domain is still
running.
>    If this fails, there is no need to move that chunk. Depending on the
page
>    size requirements (huge pages) decide whether the move is aborted or
done
>    partially.
> - in case of successful allocation suspend the domain, do the copy and
update
>    page tables for the copied pages, then resume the domain
>This is also fine, the only issue being that I''d probably need to
fiddle
with the domain max_mem, and stuff like that, wouldn''t I? I''m
saying
this because, when testing the few that I sent already, I run right into
this when I was trying to do it in the allocate-copy-deallocate order
(of course, depending on how big a chunk is, but this is going to be
much less than 1GB!).

Do you see what I mean? Do you think it would be nice to increase the
domain''s "memory allowance" (temporarily, of course) for this
to be
possible?
> - free the memory chunk on the old node(s)
> - repeat until either no memory obtained or move is finished
> 
> This will have higher overhead, but the domain will be suspended for only
> short periods of time. The memory requirements don''t matter, as
the additional
> memory will be allocated only for a short period of time. 
>Yep, this all makes sense, with the only nit being the max_mem issue
above.

Thanks again and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Juergen Gross

2013-Apr-09 09:16 UTC

head link

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

On 09.04.2013 10:51, Dario Faggioli wrote:> On mar, 2013-04-09 at 09:13 +0100, Juergen Gross wrote:
>> What about the following approach:
>>
> In general, I like it... More details below.
>
>> - do the migration in chunks (like 1GB, may be configurable)
>>
> Yes, provided these chunks are big enough, I think the overhead of is
> acceptable.
>
>> - don''t move pages which are already on one of the target
nodes
>>
> Yep, that is definitely sane, and was already on my TODO list (although,
> you''re right, I forgot to mention it in the cover or in the
various
> changelogs). It''s not there yet because I''m missing a way
of knowing on
> what node a page is, but I''m already working of putting it
together.
>
> Anyway, I agree on this too, and thanks for pointing that out. :-)
>
>> - try to allocate memory on the target node while the domain is still
running.
>>     If this fails, there is no need to move that chunk. Depending on
the page
>>     size requirements (huge pages) decide whether the move is aborted
or done
>>     partially.
>> - in case of successful allocation suspend the domain, do the copy and
update
>>     page tables for the copied pages, then resume the domain
>>
> This is also fine, the only issue being that I''d probably need to
fiddle
> with the domain max_mem, and stuff like that, wouldn''t I?
I''m saying
> this because, when testing the few that I sent already, I run right into
> this when I was trying to do it in the allocate-copy-deallocate order
> (of course, depending on how big a chunk is, but this is going to be
> much less than 1GB!).
There might be 1GB huge pages which have to be copied at once (especially for
PV-domains). Doing a migration to another node for performance reasons and
losing huge page advantages at the same time seems to be a bad idea. :-)
> Do you see what I mean? Do you think it would be nice to increase the
> domain''s "memory allowance" (temporarily, of course) for
this to be
> possible?
Would make sense, I think. :-)
>
>> - free the memory chunk on the old node(s)
>> - repeat until either no memory obtained or move is finished
>>
>> This will have higher overhead, but the domain will be suspended for
only
>> short periods of time. The memory requirements don''t matter,
as the additional
>> memory will be allocated only for a short period of time.
>>
> Yep, this all makes sense, with the only nit being the max_mem issue
> above.

Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

Dan Magenheimer

2013-Apr-09 17:43 UTC

head link

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Subject: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory
(NUMA discussion...)
> 
> XXX Open issues are:
>      - TMEM: how to "move" it?
(Konrad added to cc list.)

Tmem memory is, by definition, the lowest priority memory
for the domain and the hypervisor may already be storing it as
efficiently as possible (i.e. the page may be deduplicated).
When it is accessed by the domain (it is never directly
addressable by a domain, and a hypercall is required
to access it), an entire page is sequentially copied from
a physical page in the hypervisor to the domain.  Juergen may
know otherwise, but I''d guess this inter-node copy would be
very efficiently pipelined, cache-line by cache-line
possibly even with hardware pre-fetching.

So the best answer to "how to move it?" may be "don''t
move it at all!".  In fact, a good design for a NUMA-aware
implementation of tmem might intentionally store the data on
"any node other than the node making this tmem-put hypercall".

Dan

Dario Faggioli

2013-Apr-11 14:16 UTC

head link

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

On mar, 2013-04-09 at 18:43 +0100, Dan Magenheimer
wrote:> > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> > Subject: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory
> 
> (NUMA discussion...)
> Hi Dan,
> > 
> > XXX Open issues are:
> >      - TMEM: how to "move" it?
> 
> (Konrad added to cc list.)
> 
> Tmem memory is, by definition, the lowest priority memory
> for the domain and the hypervisor may already be storing it as
> efficiently as possible (i.e. the page may be deduplicated).
> When it is accessed by the domain (it is never directly
> addressable by a domain, and a hypercall is required
> to access it), an entire page is sequentially copied from
> a physical page in the hypervisor to the domain.  Juergen may
> know otherwise, but I''d guess this inter-node copy would be
> very efficiently pipelined, cache-line by cache-line
> possibly even with hardware pre-fetching.
> Ok, thanks for the clarification.
> So the best answer to "how to move it?" may be
"don''t
> move it at all!".  
>Ok. I sort of got the feeling that "not touching" would have been TRT
but, again, thanks for making it clear. :-)
> In fact, a good design for a NUMA-aware
> implementation of tmem might intentionally store the data on
> "any node other than the node making this tmem-put hypercall".
> Well, we''ll get there too, sooner or later. For now, and for the
purpose
of this specific work, I''ll put things in such a way that they live
TMEM
alone.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Tim Deegan

2013-May-02 14:32 UTC

head link

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

Hi, 

This looks like a promising start.  Two thoughts:

1. You currently move memory into a bufferm free it, allocate new memory
   and restore the contents.  Copying directly from old to new would be
   significantly faster, and you could do it for _most_ batches: 
   - copy old batch 0 to the backup buffer; free old batch 0;
   - allocate new batch 1; copy batch 1 directly; free old batch 1;
     ...
   - allocate new batch n; copy batch n directly; free old batch n;
   - allocate new batch 0; copy batch 0 from the backup buffer.

2. Clearing all the _PAGE_PRESENT bits with mmu-update
   hypercalls must be overkill.  It ought to be possible to drop
   those pages'' typecounts to 0 by unpinning them and then resetting
all
   the vcpus.  The you should be able to just update the contents
   with normal writes and re-pin afterwards. 

Cheers,

Tim.

At 04:49 +0200 on 09 Apr (1365482951), Dario Faggioli
wrote:> as a mechanism of deallocating and reallocating (immediately!) _all_
> the memory of a domain. Notice it relies on the guest being suspended
> already, before the function is invoked.
> 
> Of course, it is quite likely that the memory ends up in different
> places from where it was before calling it but, for instance, the fact
> that this is actually a different NUMA node (or anything else) does not
> depend by any means from this function.
> 
> In fact, here the guest pages are just freed and immediately
> re-allocated (you can see it as a very quick, back-to-back save-restore
> cycle).
> 
> If the current domain configuration says, for instance, that new
> allocations should go to a specific NUMA node, then the whole domain
> is, as a matter of facts, moved there, but again, this is not
> something this function does explicitly.
> 
> The way we do this is, very briefly, as follows:
>  1. drop all the references to all the pages of a domain,
>  2. backup the content of a batch of pages,
>  3. deallocate the a batch,
>  4. allocate a new set of pages for the batch,
>  5. copy the backed up content in the new pages,
>  6. if there are more pages, go back to 2, othwrwise
>  7. update the page tables, the vcpu contexts, the P2M, etc.
> 
> The above raises a number of quite complex issues and, _not_all_
> of them are being dealt with or solved in this series (RFC means
> something after all, doesn''t it? ;-P).
> 
> XXX Open issues are:
>      - HVM ("easy" to add, but it''s not in this patch.
See the
>             cover letter for the series);
>      - PAE guests, as they need special attention for some of
>        the page tables (should be trivial to add);
>      - grant tables/granted pages: how to move them?
>      - TMEM: how to "move" it?
>      - shared/paged pages: what to do with them?
>      - guest pages mapped in Xen, for instance:
>         * vcpu info pages: moved but, how to update the mapping?
>         * EOI page: moved but, how to update the mapping?
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
> --- a/tools/libxc/Makefile
> +++ b/tools/libxc/Makefile
> @@ -48,6 +48,11 @@ else
>  GUEST_SRCS-y += xc_nomigrate.c
>  endif
>  
> +# XXX: Well, for sure there are some X86-ism in the current code.
> +#      Making it more ARM friendly should not be a big deal though,
> +#      will do for next release.
> +GUEST_SRCS-$(CONFIG_X86) += xc_domain_movemem.c
> +
>  vpath %.c ../../xen/common/libelf
>  CFLAGS += -I../../xen/common/libelf
>  
> diff --git a/tools/libxc/xc_domain_movemem.c
b/tools/libxc/xc_domain_movemem.c
> new file mode 100644
> --- /dev/null
> +++ b/tools/libxc/xc_domain_movemem.c
> @@ -0,0 +1,766 @@
>
+/******************************************************************************
> + * xc_domain_movemem.c
> + *
> + * Deallocate and reallocate all the memory of a domain.
> + *
> + * Copyright (c) 2013, Dario Faggioli.
> + * Copyright (c) 2012, Citrix Systems, Inc.
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation;
> + * version 2.1 of the License.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 
02110-1301  USA
> + */
> +
> +#include <inttypes.h>
> +#include <time.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <sys/time.h>
> +#include <xc_core.h>
> +
> +#include "xc_private.h"
> +#include "xc_dom.h"
> +#include "xg_private.h"
> +#include "xg_save_restore.h"
> +
> +/* Needed from translation macros in xg_private.h */
> +static struct domain_info_context _dinfo;
> +static struct domain_info_context *dinfo = &_dinfo;
> +
> +#define MAX_BATCH_SIZE    1024
> +#define MAX_PIN_BATCH     1024
> +
> +#define MFN_IS_IN_PSEUDOPHYS_MAP(_mfn, _max_mfn, _minfo, _m2p) \
> +    (((_mfn) < (_max_mfn)) && ((mfn_to_pfn(_mfn, _m2p) <
(_minfo).p2m_size) && \
> +      (pfn_to_mfn(mfn_to_pfn(_mfn, _m2p), (_minfo).p2m_table, \
> +                  (_minfo).guest_width) == (_mfn))))
> +
> +/*
> + * This is to determine which entries in this page table hold reserved
> + * hypervisor mappings. This depends on the current page table type as
> + * well as the number of paging levels (see also xc_domain_save.c).
> + *
> + * XXX: export this function so that it can be used both here and from
> + *      canonicalize_pagetable(), in xc_domain_save.c.
> + */
> +static int is_xen_mapping(struct xc_domain_meminfo *minfo, unsigned long
type,
> +                          unsigned long hvirt_start, unsigned long
m2p_mfn0,
> +                          const void *spage, int pte)
> +{
> +    int xen_start, xen_end, pte_last;
> +
> +    xen_start = xen_end = pte_last = PAGE_SIZE / 8;
> +
> +    if ( (minfo->pt_levels == 3) && (type ==
XEN_DOMCTL_PFINFO_L3TAB) )
> +        xen_start = L3_PAGETABLE_ENTRIES_PAE;
> +
> +    /*
> +     * In PAE only the L2 mapping the top 1GB contains Xen mappings.
> +     * We can spot this by looking for the guest''s mappingof the
m2p.
> +     * Guests must ensure that this check will fail for other L2s.
> +     */
> +    if ( (minfo->pt_levels == 3) && (type ==
XEN_DOMCTL_PFINFO_L2TAB) )
> +    {
> +        int hstart;
> +        uint64_t he;
> +
> +        hstart = (hvirt_start >> L2_PAGETABLE_SHIFT_PAE) &
0x1ff;
> +        he = ((const uint64_t *) spage)[hstart];
> +
> +        if ( ((he >> PAGE_SHIFT) & MFN_MASK_X86) == m2p_mfn0 )
> +        {
> +            /* hvirt starts with xen stuff... */
> +            xen_start = hstart;
> +        }
> +        else if ( hvirt_start != 0xf5800000 )
> +        {
> +            /* old L2s from before hole was shrunk... */
> +            hstart = (0xf5800000 >> L2_PAGETABLE_SHIFT_PAE) &
0x1ff;
> +            he = ((const uint64_t *) spage)[hstart];
> +            if ( ((he >> PAGE_SHIFT) & MFN_MASK_X86) == m2p_mfn0
)
> +                xen_start = hstart;
> +        }
> +    }
> +
> +    if ( (minfo->pt_levels == 4) && (type ==
XEN_DOMCTL_PFINFO_L4TAB) )
> +    {
> +        /*
> +         * XXX SMH: should compute these from hvirt_start (which we have)
> +         * and hvirt_end (which we don''t)
> +         */
> +        xen_start = 256;
> +        xen_end   = 272;
> +    }
> +
> +    return pte >= xen_start && pte < xen_end;
> +}
> +
> +/*
> + * This function will basically deallocate _all_ the memory of a domain
and
> + * reallocate it immediately. It relies on the guest being suspended
> + * already, before the function is even invoked.
> + *
> + * Of course, it is quite likely that the memory ends up in different
places
> + * from where it was before calling this but, for instance, the fact that
> + * this is actually a different NUMA node (or anything else) does not
> + * depend by any means from this function. In fact, here the guest pages
are
> + * just freed and immediately re-allocated (you can see it as a very
quick,
> + * back-to-back domain_save--domain_restore). If the current domain
> + * configuration says, for instance, that new allocation should go to a
> + * different NUMA nodes, then the whole domain is moved to there, but
again,
> + * this is not something this function does explicitly.
> + *
> + * If actually interested in doing something like that (i.e., moving the
> + * domain to a different NUMA node), calling xc_domain_node_setaffinity()
> + * right before this should achieve it.
> + */
> +int xc_domain_move_memory(xc_interface *xch, uint32_t domid/*, int hvm*/)
> +{
> +    unsigned int i, j;
> +    int rc = 1;
> +
> +    xc_dominfo_t info;
> +    struct xc_domain_meminfo minfo;
> +
> +    struct mmuext_op pin[MAX_PIN_BATCH];
> +    unsigned int nr_pins;
> +
> +    struct xc_mmu *mmu = NULL;
> +    unsigned int xen_pt_levels, dom_guest_width;
> +    unsigned long max_mfn, hvirt_start, m2p_mfn0;
> +    vcpu_guest_context_any_t ctxt;
> +
> +    void *live_p2m_frame_list_list = NULL;
> +    void *live_p2m_frame_list = NULL;
> +
> +    /*
> +     * XXX: grant tables & granted pages need to be considered, e.g.,
> +     *      using xc_is_page_granted_vX() in xc_offline_page.c to
> +     *      recognise them, etc.
> +    int gnt_num;
> +    grant_entry_v1_t *gnttab_v1 = NULL;
> +    grant_entry_v2_t *gnttab_v2 = NULL;
> +     */
> +
> +    void *old_p, *new_p, *backup = NULL;
> +    unsigned long mfn, pfn;
> +    uint64_t fll;
> +
> +    xen_pfn_t *new_mfns= NULL, *old_mfns = NULL, *batch_pfns = NULL;
> +    int pte_num = PAGE_SIZE / 8, cleared_pte = 0;
> +    xen_pfn_t *m2p_table, *orig_m2p = NULL;
> +    shared_info_any_t *live_shinfo = NULL;
> +
> +    unsigned long n = 0, n_skip = 0;
> +
> +    int debug = 0; /* XXX will become a parameter */
> +
> +    if ( !get_platform_info(xch, domid, &max_mfn, &hvirt_start,
> +                            &xen_pt_levels, &dom_guest_width) )
> +    {
> +        ERROR("Failed getting platform info");
> +        return 1;
> +    }
> +
> +    /* We expect domain to be suspende already */
> +    if ( xc_domain_getinfo(xch, domid, 1, &info) != 1 )
> +    {
> +        PERROR("Failed getting domain info");
> +        return 1;
> +    }
> +    if ( !info.shutdown || info.shutdown_reason != SHUTDOWN_suspend)
> +    {
> +        PERROR("Domain appears not to be suspended");
> +        return 1;
> +    }
> +
> +    DBGPRINTF("Establishing the mappings for M2P and P2M");
> +    memset(&minfo, 0, sizeof(minfo));
> +    if ( !(m2p_table = xc_map_m2p(xch, max_mfn, PROT_READ, &m2p_mfn0))
)
> +    {
> +        PERROR("Failed to map the M2P table");
> +        return 1;
> +    }
> +    if ( xc_map_domain_meminfo(xch, domid, &minfo) )
> +    {
> +        PERROR("Failed to map domain''s memory
information");
> +        goto out;
> +    }
> +    dinfo->guest_width = minfo.guest_width;
> +    dinfo->p2m_size = minfo.p2m_size;
> +
> +    /*
> +     * XXX
> +    DBGPRINTF("Mapping the grant tables");
> +    gnttab_v2 = xc_gnttab_map_table_v2(xch, domid, &gnt_num);
> +    if (!gnttab_v2)
> +    {
> +        PERROR("Failed to map V1 grant table... Trying V1");
> +        gnttab_v1 = xc_gnttab_map_table_v1(xch, domid, &gnt_num);
> +        if (!gnttab_v1)
> +        {
> +            PERROR("Failed to map grant table");
> +            goto out;
> +        }
> +    }
> +    DBGPRINTF("Grant table mapped. %d grants found", gnt_num);
> +     */
> +
> +    mmu = xc_alloc_mmu_updates(xch, (domid+1)<<16|domid);
> +    if ( mmu == NULL )
> +    {
> +        PERROR("Failed to allocate memory for MMU updates");
> +        goto out;
> +    }
> +
> +    /* Alloc support data structures */
> +    new_mfns = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t));
> +    old_mfns = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t));
> +    batch_pfns = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t));
> +
> +    backup = malloc(PAGE_SIZE * MAX_BATCH_SIZE);
> +
> +    orig_m2p = calloc(max_mfn, sizeof(xen_pfn_t));
> +
> +    if ( !new_mfns || !old_mfns || !batch_pfns || !backup || !orig_m2p )
> +    {
> +        ERROR("Failed to allocate copying and/or backup data
structures");
> +        goto out;
> +    }
> +
> +    DBGPRINTF("Saving the original M2P");
> +    memcpy(orig_m2p, m2p_table, max_mfn * sizeof(xen_pfn_t));
> +
> +    DBGPRINTF("Starting deallocating and reallocating all memory for
domain %d"
> +              "\n\tnr_pages=%lu, nr_shared_pages=%lu,
nr_paged_pages=%lu"
> +              "\n\tnr_online_vcpus=%u, max_vcpu_id=%u",
> +              domid, info.nr_pages, info.nr_shared_pages,
info.nr_paged_pages,
> +              info.nr_online_vcpus, info.max_vcpu_id);
> +
> +    /* Beware: no going back from this point!! */
> +
> +    /*
> +     * As a part of the process of dropping all the references to the
existing
> +     * pages in memory, so that we can free (and then re-allocate them) we
need
> +     * to unpin them.
> +     *
> +     * We do that in batches of 1024 PFNs at each step, to amortize the
cost
> +     * of xc_mmuext_op() calls.
> +     */
> +    nr_pins = 0;
> +    for ( i = 0; i < minfo.p2m_size; i++ )
> +    {
> +        if ( (minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 )
> +            continue;
> +
> +        pin[nr_pins].cmd = MMUEXT_UNPIN_TABLE;
> +        pin[nr_pins].arg1.mfn = minfo.p2m_table[i];
> +        nr_pins++;
> +
> +        if ( nr_pins == MAX_PIN_BATCH )
> +        {
> +            if ( xc_mmuext_op(xch, pin, nr_pins, domid) < 0 )
> +            {
> +                PERROR("Failed to unpin a batch of %d MFNs",
nr_pins);
> +                goto out;
> +            }
> +            else
> +                DBGPRINTF("Unpinned a batch of %d MFNs",
nr_pins);
> +            nr_pins = 0;
> +        }
> +    }
> +    if ( (nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, domid)
< 0) )
> +    {
> +        PERROR("Failed to unpin a batch of %d MFNs", nr_pins);
> +        goto out;
> +    }
> +    else
> +        DBGPRINTF("Unpinned a batch of %d MFNs", nr_pins);
> +
> +    /*
> +     * After unpinning, we also need to remove the _PAGE_PRESENT bit from
> +     * the domain''s PTEs, for the pages that we want to
deallocate, or they
> +     * just could not go away.
> +     */
> +    for (i = 0; i < minfo.p2m_size; i++)
> +    {
> +        void *content;
> +        xen_pfn_t table_type, table_mfn = pfn_to_mfn(i, minfo.p2m_table,
> +                                                     minfo.guest_width);
> +
> +        if ( table_mfn == INVALID_P2M_ENTRY ||
> +             minfo.pfn_type[i] == XEN_DOMCTL_PFINFO_XTAB )
> +        {
> +            DBGPRINTF("Broken P2M entry at PFN 0x%x", i);
> +            continue;
> +        }
> +
> +        table_type = minfo.pfn_type[i] &
XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
> +        if ( table_type < XEN_DOMCTL_PFINFO_L1TAB ||
> +             table_type > XEN_DOMCTL_PFINFO_L4TAB )
> +            continue;
> +
> +        content = xc_map_foreign_range(xch, domid, PAGE_SIZE,
> +                                       PROT_READ, table_mfn);
> +        if ( !content )
> +        {
> +            PERROR("Failed to map the table at MFN 0x%lx",
table_mfn);
> +            goto out;
> +        }
> +
> +        /* Go through each PTE of each table and clear the _PAGE_PRESENT
bit */
> +        for ( j = 0; j < pte_num; j++ )
> +        {
> +            uint64_t pte = ((uint64_t *)content)[j];
> +
> +            if ( !pte || is_xen_mapping(&minfo, table_type,
hvirt_start, m2p_mfn0, content, j) )
> +                continue;
> +
> +            if ( debug )
> +                DBGPRINTF("Entry %d: PTE=0x%lx, MFN=0x%lx,
PFN=0x%lx", j, pte,
> +                          (uint64_t)((pte &
MADDR_MASK_X86)>>PAGE_SHIFT),
> +                          m2p_table[(unsigned long)((pte &
MADDR_MASK_X86)
> +                                                    >>PAGE_SHIFT)]);
> +
> +            pfn = m2p_table[(pte & MADDR_MASK_X86)>>PAGE_SHIFT];
> +            pte &= ~_PAGE_PRESENT;
> +
> +            if ( xc_add_mmu_update(xch, mmu, table_mfn << PAGE_SHIFT
|
> +                              (j * (sizeof(uint64_t))) |
> +                              MMU_PT_UPDATE_PRESERVE_AD, pte) )
> +                PERROR("Failed to add some PTE update
operation");
> +            else
> +                cleared_pte++;
> +        }
> +
> +        if (content)
> +            munmap(content, PAGE_SIZE);
> +    }
> +    if ( cleared_pte && xc_flush_mmu_updates(xch, mmu) )
> +    {
> +        PERROR("Failed flushing some PTE update operations");
> +        goto out;
> +    }
> +    else
> +        DBGPRINTF("Cleared presence for %d PTEs", cleared_pte);
> +
> +    /* Scan all the P2M ... */
> +    while ( n < minfo.p2m_size )
> +    {
> +        /* ... But all operations are done in batches */
> +        for ( i = 0; (i < MAX_BATCH_SIZE) && (n <
minfo.p2m_size); n++ )
> +        {
> +            xen_pfn_t mfn = pfn_to_mfn(n, minfo.p2m_table,
minfo.guest_width);
> +            xen_pfn_t mfn_type = minfo.pfn_type[n] &
XEN_DOMCTL_PFINFO_LTAB_MASK;
> +
> +            if (mfn == INVALID_P2M_ENTRY || !is_mapped(mfn) )
> +            {
> +                if ( debug )
> +                    DBGPRINTF("Skipping invalid or unmapped MFN
0x%lx", mfn);
> +                n_skip++;
> +                continue;
> +            }
> +            if ( mfn_type == XEN_DOMCTL_PFINFO_BROKEN ||
> +                 mfn_type == XEN_DOMCTL_PFINFO_XTAB ||
> +                 mfn_type == XEN_DOMCTL_PFINFO_XALLOC )
> +            {
> +                if ( debug )
> +                    DBGPRINTF("Skippong broken or alloc only MFN
0x%lx", mfn);
> +                n_skip++;
> +                continue;
> +            }
> +
> +            /*
> +            if ( gnttab_v1 ?
> +                 xc_is_page_granted_v1(xch, mfn, gnttab_v1, gnt_num) :
> +                 xc_is_page_granted_v2(xch, mfn, gnttab_v2, gnt_num) )
> +            {
> +                n_skip++;
> +                continue;
> +            }
> +             */
> +
> +            old_mfns[i] = mfn;
> +            batch_pfns[i] = n;
> +            i++;
> +        }
> +
> +        /* Was the batch empty? */
> +        if ( i == 0)
> +            continue;
> +
> +        /*
> +         * And now the core of the whole thing: map the PFNs in the batch,
> +         * backup them, allocate new pages for them, and copy them there.
> +         * We do this in this order, and we pass through a local backup,
> +         * because we don''t want to risk hitting the max_mem
limit for
> +         * the domain (which would be possible, depending on
MAX_BATCH_SIZE,
> +         * if we try to do it like allocate->copy->deallocate).
> +         *
> +         * With MAX_BATCH_SIZE of 1024 and 4K pages, this means we are
moving
> +         * 4MB of guest memory for each batch.
> +         */
> +
> +        /* Map and backup */
> +        old_p = xc_map_foreign_pages(xch, domid, PROT_READ, old_mfns, i);
> +        if ( !old_p )
> +        {
> +            PERROR("Failed mapping the current MFNs\n");
> +            goto out;
> +        }
> +        memcpy(backup, old_p, PAGE_SIZE * i);
> +        munmap(old_p, PAGE_SIZE * i);
> +
> +        /* Deallocation and re-allocation */
> +        if ( xc_domain_decrease_reservation(xch, domid, i, 0, old_mfns) !=
i ||
> +             xc_domain_populate_physmap_exact(xch, domid, i, 0, 0,
new_mfns) )
> +        {
> +            PERROR("Failed making space or allocating the new
MFNs\n");
> +            munmap(backup, PAGE_SIZE * i);
> +            goto out;
> +        }
> +
> +        /* Map of new pages, copy content and unmap */
> +        new_p = xc_map_foreign_pages(xch, domid, PROT_WRITE, new_mfns, i);
> +        if ( !new_p )
> +        {
> +            PERROR("Failed mapping the new MFNs\n");
> +            munmap(backup, PAGE_SIZE * i);
> +            goto out;
> +        }
> +        memcpy(new_p, backup, PAGE_SIZE * i);
> +        munmap(new_p, PAGE_SIZE * i);
> +        munmap(backup, PAGE_SIZE * i);
> +
> +        /*
> +         * Since we already have the new MFNs, we can update both the M2P
> +         * and the P2M right here, within this same loop.
> +         */
> +        for ( j = 0; j < i; j++ )
> +        {
> +            minfo.p2m_table[batch_pfns[j]] = new_mfns[j];
> +            if ( xc_add_mmu_update(xch, mmu,
> +                                   (((uint64_t)new_mfns[j]) <<
PAGE_SHIFT) |
> +                                   MMU_MACHPHYS_UPDATE, batch_pfns[j]) )
> +            {
> +                PERROR("Failed updating M2P\n");
> +                goto out;
> +            }
> +        }
> +        if ( xc_flush_mmu_updates(xch, mmu) )
> +        {
> +            PERROR("Failed updating M2P\n");
> +            goto out;
> +        }
> +
> +        DBGPRINTF("Batch %lu/%ld done (%lu pages skipped)",
> +                  n / MAX_BATCH_SIZE, minfo.p2m_size / MAX_BATCH_SIZE,
n_skip);
> +    }
> +
> +    /*
> +     * Finally (oh, well...) update the PTEs of the domain again, putting
> +     * the new MFNs there, and making the entries _PAGE_PRESENT again.
> +     *
> +     * This is a kind-of uncanonicalization, like it happens in
save-resrote,
> +     * although a very special one, and we rely on the snapshot of the M2P
> +     * we made before starting all the deallocation/reallocation process.
> +     */
> +    for ( i = 0; i < minfo.p2m_size; i++ )
> +    {
> +        void *content;
> +        xen_pfn_t table_type, table_mfn = pfn_to_mfn(i, minfo.p2m_table,
> +                                                     minfo.guest_width);
> +
> +        table_type = minfo.pfn_type[i] &
XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
> +        if ( table_type < XEN_DOMCTL_PFINFO_L1TAB ||
> +             table_type > XEN_DOMCTL_PFINFO_L4TAB )
> +            continue;
> +
> +        /* We of course only care about tables */
> +        content = xc_map_foreign_range(xch, domid, PAGE_SIZE,
> +                                       PROT_WRITE, table_mfn);
> +        if ( !content )
> +        {
> +            PERROR("Failed to map the table at MFN 0x%lx",
table_mfn);
> +            continue;
> +        }
> +
> +        for ( j = 0; j < PAGE_SIZE / 8; j++ )
> +        {
> +            uint64_t pte = ((uint64_t *)content)[j];
> +
> +            if ( !pte || is_xen_mapping(&minfo, table_type,
hvirt_start, m2p_mfn0, content, j) )
> +                continue;
> +
> +            /*
> +             * Basically, we lookup the PFN from the snapshoted M2P and we
> +             * pick up the new MFN from the P2M (since we updated it
"live"
> +             * during the re-allocation phase above).
> +             */
> +            mfn = (pte >> PAGE_SHIFT) & MFN_MASK_X86;
> +            pfn = orig_m2p[mfn];
> +
> +            if ( debug )
> +                DBGPRINTF("Table[PTE]: 0x%lx[%d] ==>
orig_m2p[0x%lx]=0x%lx, "
> +                          "p2m[0x%lx]=0x%lx // pte: 0x%lx -->
0x%lx",
> +                          table_mfn, j, mfn, pfn, pfn,
minfo.p2m_table[pfn],
> +                          pte,  (uint64_t)((pte & ~MADDR_MASK_X86)|
> +                                          
(minfo.p2m_table[pfn]<<PAGE_SHIFT)|
> +                                            _PAGE_PRESENT));
> +
> +            mfn = minfo.p2m_table[pfn];
> +            pte &= ~MADDR_MASK_X86;
> +            pte |= (uint64_t)mfn << PAGE_SHIFT;
> +            pte |= _PAGE_PRESENT;
> +
> +            ((uint64_t *)content)[j] = pte;
> +
> +            if ( !MFN_IS_IN_PSEUDOPHYS_MAP(mfn, max_mfn, minfo, m2p_table)
)
> +            {
> +                ERROR("Failed updating entry %d in table at MFN
0x%lx", j, table_mfn);
> +                continue; // XXX
> +            }
> +        }
> +
> +        if ( content )
> +            munmap(content, PAGE_SIZE);
> +    }
> +
> +    DBGPRINTF("Re-pinning page table MFNs");
> +
> +    /* Pin the able types again */
> +    nr_pins = 0;
> +    for ( i = 0; i < minfo.p2m_size; i++ )
> +    {
> +        if ( (minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 )
> +            continue;
> +
> +        switch ( minfo.pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
> +        {
> +        case XEN_DOMCTL_PFINFO_L1TAB:
> +            pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE;
> +            break;
> +
> +        case XEN_DOMCTL_PFINFO_L2TAB:
> +            pin[nr_pins].cmd = MMUEXT_PIN_L2_TABLE;
> +            break;
> +
> +        case XEN_DOMCTL_PFINFO_L3TAB:
> +            pin[nr_pins].cmd = MMUEXT_PIN_L3_TABLE;
> +            break;
> +
> +        case XEN_DOMCTL_PFINFO_L4TAB:
> +            pin[nr_pins].cmd = MMUEXT_PIN_L4_TABLE;
> +            break;
> +        default:
> +            continue;
> +        }
> +        pin[nr_pins].arg1.mfn = minfo.p2m_table[i];
> +        nr_pins++;
> +
> +        if ( nr_pins == MAX_PIN_BATCH )
> +        {
> +            if ( xc_mmuext_op(xch, pin, nr_pins, domid) < 0 )
> +            {
> +                PERROR("Failed to pin a batch of %d MFNs",
nr_pins);
> +                goto out;
> +            }
> +            else
> +                DBGPRINTF("Re-pinned a batch of %d MFNs",
nr_pins);
> +            nr_pins = 0;
> +        }
> +    }
> +    if ( (nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, domid)
< 0) )
> +    {
> +        PERROR("Failed to pin batch of %d page tables",
nr_pins);
> +        goto out;
> +    }
> +    else
> +        DBGPRINTF("Re-pinned a batch of %d MFNs", nr_pins);
> +
> +    /*
> +     * Now, take care of the vCPUs contextes. It all happens as above,
> +     * we use the original M2P and the new domain''s P2M to update
all
> +     * the various references.
> +     */
> +    for ( i = 0; i <= info.max_vcpu_id; i++ )
> +    {
> +        xc_vcpuinfo_t vinfo;
> +
> +        DBGPRINTF("Adjusting context for VCPU%d", i);
> +
> +        if ( xc_vcpu_getinfo(xch, domid, i, &vinfo) )
> +        {
> +            PERROR("Failed getting info for VCPU%d", i);
> +            goto out;
> +        }
> +        if ( !vinfo.online )
> +        {
> +            DBGPRINTF("VCPU%d seems offline", i);
> +            continue;
> +        }
> +
> +        if ( xc_vcpu_getcontext(xch, domid, i, &ctxt) )
> +        {
> +            PERROR("No context for VCPU%d", i);
> +            goto out;
> +        }
> +
> +        if ( i == 0 )
> +        {
> +            //start_info_any_t *start_info;
> +
> +            /*
> +             * Update the start info frame number. It is the 3rd argument
> +             * to the HYPERVISOR_sched_op hypercall when op is
> +             * SCHEDOP_shutdown and reason is SHUTDOWN_suspend, so we find
> +             * it in EDX.
> +             */
> +            mfn = GET_FIELD(&ctxt, user_regs.edx);
> +            mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)];
> +            SET_FIELD(&ctxt, user_regs.edx, mfn);
> +
> +            /*
> +             * XXX: I checke, and store_mfn and console_mfn seemed ok, at
> +             *      least from a ''mapping'' point of
view, but more testing is
> +             *      needed.
> +            start_info = xc_map_foreign_range(xch, domid, PAGE_SIZE,
PROT_READ | PROT_WRITE, mfn);
> +            munmap(start_info, PAGE_SIZE);
> +             */
> +        }
> +
> +        /* GDT pointing MFNs */
> +        for ( j = 0; (512*j) < GET_FIELD(&ctxt, gdt_ents); j++ )
> +        {
> +            mfn = GET_FIELD(&ctxt, gdt_frames[j]);
> +            mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)];
> +            SET_FIELD(&ctxt, gdt_frames[j], mfn);
> +        }
> +
> +        /* CR3 XXX: PAE needs special attenion here, I think */
> +        mfn = UNFOLD_CR3(GET_FIELD(&ctxt, ctrlreg[3]));
> +        mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)];
> +        SET_FIELD(&ctxt, ctrlreg[3], FOLD_CR3(mfn));
> +
> +        /* Guest pagetable (x86/64) in CR1 */
> +        if ( (minfo.pt_levels == 4) && ctxt.x64.ctrlreg[1] )
> +        {
> +            /*
> +             * XXX: save-restore code mangle with the least-significant
> +             *      bit (''valid PFN''). This should not
be needed in here.
> +             */
> +            mfn = UNFOLD_CR3(ctxt.x64.ctrlreg[1]);
> +            mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)];
> +            ctxt.x64.ctrlreg[1] = FOLD_CR3(mfn);
> +        }
> +
> +        /*
> +         * XXX: Xen refuses to set a new context for an existing vCPU if
> +         *      things like CR3, the GDTs have changed, even if the domain
> +         *      is suspended. Going through re-initializing the vCPU (by
> +         *      this one call below with a NULL ctxt) makes it possible,
> +         *      but is that sensible? And even if yes, is that the
following
> +         *      _setcontext call issued below enough?
> +         */
> +        if ( xc_vcpu_setcontext(xch, domid, i, NULL) )
> +        {
> +            PERROR("Failed re-initialising VCPU%d", i);
> +            goto out;
> +        }
> +        if ( xc_vcpu_setcontext(xch, domid, i, &ctxt) )
> +        {
> +            PERROR("Failed when updating context for VCPU%d",
i);
> +            goto out;
> +        }
> +    }
> +
> +    /*
> +     * Finally (an this time for real), we take care of the pages mapping
> +     * the P2M, and of the P2M entries themselves.
> +     */
> +
> +    live_shinfo = xc_map_foreign_range(xch, domid,
> +                     PAGE_SIZE, PROT_READ|PROT_WRITE,
info.shared_info_frame);
> +    if ( !live_shinfo )
> +    {
> +        PERROR("Failed mapping live_shinfo");
> +        goto out;
> +    }
> +
> +    fll = GET_FIELD(live_shinfo, arch.pfn_to_mfn_frame_list_list);
> +    fll = minfo.p2m_table[mfn_to_pfn(fll, orig_m2p)];
> +    live_p2m_frame_list_list = xc_map_foreign_range(xch, domid, PAGE_SIZE,
> +                                                    PROT_READ|PROT_WRITE,
fll);
> +    if ( !live_p2m_frame_list_list )
> +    {
> +        PERROR("Couldn''t map
live_p2m_frame_list_list");
> +        goto out;
> +    }
> +    SET_FIELD(live_shinfo, arch.pfn_to_mfn_frame_list_list, fll);
> +
> +    /* First, update the frames caontaining the list of the P2M frames */
> +    for ( i = 0; i < P2M_FLL_ENTRIES; i++ )
> +    {
> +
> +        mfn = ((uint64_t *)live_p2m_frame_list_list)[i];
> +        mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)];
> +        ((uint64_t *)live_p2m_frame_list_list)[i] = mfn;
> +    }
> +
> +    live_p2m_frame_list > +        xc_map_foreign_pages(xch, domid,
PROT_READ|PROT_WRITE,
> +                             live_p2m_frame_list_list,
> +                             P2M_FLL_ENTRIES);
> +    if ( !live_p2m_frame_list )
> +    {
> +        PERROR("Couldn''t map live_p2m_frame_list");
> +        goto out;
> +    }
> +
> +    /* And then update the actual entries of it */
> +    for ( i = 0; i < P2M_FL_ENTRIES; i++ )
> +    {
> +        mfn = ((uint64_t *)live_p2m_frame_list)[i];
> +        mfn = minfo.p2m_table[mfn_to_pfn(mfn, orig_m2p)];
> +        ((uint64_t *)live_p2m_frame_list)[i] = mfn;
> +    }
> +
> +    rc = 0;
> +
> + out:
> +    if ( live_p2m_frame_list_list )
> +        munmap(live_p2m_frame_list_list, PAGE_SIZE);
> +    if ( live_p2m_frame_list )
> +        munmap(live_p2m_frame_list, P2M_FLL_ENTRIES * PAGE_SIZE);
> +    if ( live_shinfo )
> +        munmap(live_shinfo, PAGE_SIZE);
> +
> +    free(mmu);
> +    free(new_mfns);
> +    free(old_mfns);
> +    free(batch_pfns );
> +    free(backup);
> +    free(orig_m2p);
> +
> +    /*
> +    if (gnttab_v1)
> +        munmap(gnttab_v1, gnt_num / (PAGE_SIZE/sizeof(grant_entry_v1_t)));
> +    if (gnttab_v2)
> +        munmap(gnttab_v2, gnt_num / (PAGE_SIZE/sizeof(grant_entry_v2_t)));
> +     */
> +
> +    xc_unmap_domain_meminfo(xch, &minfo);
> +    munmap(m2p_table, M2P_SIZE(max_mfn));
> +
> +    return !!rc;
> +}
> diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h
> --- a/tools/libxc/xenguest.h
> +++ b/tools/libxc/xenguest.h
> @@ -272,6 +272,15 @@ int xc_query_page_offline_status(xc_inte
>  
>  int xc_exchange_page(xc_interface *xch, int domid, xen_pfn_t mfn);
>  
> +/**
> + * This function deallocates all the guests memory and allocates it
> + * again and immediately, with the net effect of moving it somewhere
> + * else wrt where it is when the function is invoked.
> + *
> + * @param xch a handle to an open hypervisor interface.
> + * @param domid the domain id one wants to move the memory of.
> + */
> +int xc_domain_move_memory(xc_interface *xch, uint32_t domid/*, int hvm*/);
>  
>  /**
>   * Memory related information, such as PFN types, the P2M table,
> diff --git a/tools/libxc/xg_private.h b/tools/libxc/xg_private.h
> --- a/tools/libxc/xg_private.h
> +++ b/tools/libxc/xg_private.h
> @@ -145,6 +145,11 @@ static inline xen_pfn_t pfn_to_mfn(xen_p
>                              (((uint32_t *)p2m)[(pfn)]))));
>  }
>  
> +static inline xen_pfn_t mfn_to_pfn(xen_pfn_t mfn, xen_pfn_t *m2p)
> +{
> +    return m2p[mfn];
> +}
> +
>  /* Number of xen_pfn_t in a page */
>  #define FPP             (PAGE_SIZE/(dinfo->guest_width))
>  
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

George Dunlap

2013-May-02 15:07 UTC

head link

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

On 02/05/13 15:32, Tim Deegan wrote:> Hi,
>
> This looks like a promising start.  Two thoughts:
>
> 1. You currently move memory into a bufferm free it, allocate new memory
>     and restore the contents.  Copying directly from old to new would be
>     significantly faster, and you could do it for _most_ batches:
>     - copy old batch 0 to the backup buffer; free old batch 0;
>     - allocate new batch 1; copy batch 1 directly; free old batch 1;
>       ...
>     - allocate new batch n; copy batch n directly; free old batch n;
>     - allocate new batch 0; copy batch 0 from the backup buffer.
Hmm -- isn''t it the case that if there is not *free* memory lying
around
somewhere, then this operation is fairly pointless?  What will happen is 
that after freeing batch 0, "allocate new batch 1" will get that 
memory.  So copying it to a temporary buffer in dom0 seems like not a 
particularly useful thing to do -- it should try to allocate a new 
buffer to copy into directly, and if that fails, just say "No point 
trying -- no empty memory to move into."

Unless of course we were trying to do this to two (or more) VMs at the 
same time, but that seems like the next level.

  -George

Tim Deegan

2013-May-02 15:13 UTC

head link

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

At 16:07 +0100 on 02 May (1367510834), George Dunlap
wrote:> On 02/05/13 15:32, Tim Deegan wrote:
> >Hi,
> >
> >This looks like a promising start.  Two thoughts:
> >
> >1. You currently move memory into a bufferm free it, allocate new
memory
> >    and restore the contents.  Copying directly from old to new would
be
> >    significantly faster, and you could do it for _most_ batches:
> >    - copy old batch 0 to the backup buffer; free old batch 0;
> >    - allocate new batch 1; copy batch 1 directly; free old batch 1;
> >      ...
> >    - allocate new batch n; copy batch n directly; free old batch n;
> >    - allocate new batch 0; copy batch 0 from the backup buffer.
> 
> Hmm -- isn''t it the case that if there is not *free* memory lying
around
> somewhere, then this operation is fairly pointless?  What will happen is 
> that after freeing batch 0, "allocate new batch 1" will get that 
> memory.  So copying it to a temporary buffer in dom0 seems like not a 
> particularly useful thing to do -- it should try to allocate a new 
> buffer to copy into directly, and if that fails, just say "No point 
> trying -- no empty memory to move into."
Sure, that''s better, as long as the temporary bump in the VM''s
max_pages
is acceptable to the rest of the toolstack. :)

Tim.

Dario Faggioli

2013-May-06 17:29 UTC

head link

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

On gio, 2013-05-02 at 15:32 +0100, Tim Deegan wrote:> Hi,
> Hi Tim,

Thanks for looking at this! :-)
> This looks like a promising start.  Two thoughts:
> 
> 1. You currently move memory into a bufferm free it, allocate new memory
>    and restore the contents.  Copying directly from old to new would be
>    significantly faster, and you could do it for _most_ batches:
>    - copy old batch 0 to the backup buffer; free old batch 0;
>    - allocate new batch 1; copy batch 1 directly; free old batch 1;
>      ...
>    - allocate new batch n; copy batch n directly; free old batch n;
>    - allocate new batch 0; copy batch 0 from the backup buffer.
> I see what you mean, and I think it''s feasible. One thing I noticed
(and
not yet tracked down properly, actually) is some sort of "latency" in
freeing the pages... I''ll investigate that better and go for what you
suggest if possible.
> 2. Clearing all the _PAGE_PRESENT bits with mmu-update
>    hypercalls must be overkill.  It ought to be possible to drop
>    those pages'' typecounts to 0 by unpinning them and then
resetting all
>    the vcpus.  The you should be able to just update the contents
>    with normal writes and re-pin afterwards.
> Yeah, I thought the same, but haven''t found a sensible way of making
that happen yet. However, the ''reset all vcpus'' thing
definitely needs
more attention (and I''m investigating it right in these days).
I''ll keep
digging and let you know what I find.

Thanks again,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-May-06 17:37 UTC

head link

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

On gio, 2013-05-02 at 16:13 +0100, Tim Deegan wrote:> At 16:07 +0100 on 02 May (1367510834), George Dunlap wrote:
>
> > Hmm -- isn''t it the case that if there is not *free* memory
lying around
> > somewhere, then this operation is fairly pointless?  What will happen
is
> > that after freeing batch 0, "allocate new batch 1" will get
that
> > memory.  So copying it to a temporary buffer in dom0 seems like not a 
> > particularly useful thing to do -- it should try to allocate a new 
> > buffer to copy into directly, and if that fails, just say "No
point
> > trying -- no empty memory to move into."
> George, good point, checking for free memory is something I did not
think about, but it''s necessary for this while thing to be meaningful.
This could be tricky to do in the right way, due to the well known races
we have when dealing with memory at the toolstack level, but I''ll give
it a thought, thanks. :-)

However...
> Sure, that''s better, as long as the temporary bump in the
VM''s max_pages
> is acceptable to the rest of the toolstack. :)
> ... This that Tim is saying is the main reason I''m going through a
temporary buffer in Dom0: I can''t be sure that, if failing allocating
more memory for the domain before freeing it, that comes from the host
being actually out-of-memory, or from the fact that I''m hitting
max_pages.

That''s why I went for the "deallocate first" approach. I can
investigate
what temporarily bumping the page limit could mean, but I think I like
what Tim proposed in his first e-mail better...

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Xen devel - Apr 2013 - [PATCH 0 of 8 [RFC]] Move all the memory of a domain

[PATCH 0 of 8 [RFC]] Move all the memory of a domain

[PATCH 1 of 8 [RFC]] xl: allow for node-wise specification of vcpu pinning

[PATCH 2 of 8 [RFC]] xl: allow for changing NUMA node affinity on-line

[PATCH 3 of 8 [RFC]] libxc: introduce xc_domain_get_address_size

[PATCH 4 of 8 [RFC]] libxc: introduce xc_map_domain_meminfo (and xc_unmap_domain_meminfo)

[PATCH 5 of 8 [RFC]] libxc: allow for ctxt to be NULL in xc_vcpu_setcontext

[PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

[PATCH 7 of 8 [RFC]] libxl: introduce libxl_domain_move_memory

[PATCH 8 of 8 [RFC]] tools/misc: introduce xen-mfndump

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory

Re: [PATCH 6 of 8 [RFC]] libxc: introduce xc_domain_move_memory