thr3ads.net - Xen devel - [RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service [Apr 2013]

If this information is useful, please help other people find it:
Share via:

Wen Congyang

2013-Apr-03 08:02 UTC

[RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service

Virtual machine (VM) replication is a well known technique for providing
application-agnostic software-implemented hardware fault tolerance -
"non-stop service". Currently, remus provides this function, but it
buffers
all output packets, and the latency is unacceptable.

In xen summit 2012, We introduce a new VM replication solution: colo
(COarse-grain LOck-stepping virtual machine). The presentation is in
the following URL:
http://www.slideshare.net/xen_com_mgr/colo-coarsegrain-lockstepping-virtual-machines-for-nonstop-service

Here is the summary of the solution:
From the client''s point of view, as long as the client observes
identical
responses from the primary and secondary VMs, according to the service
semantics, then the secondary VM(SVM) is a valid replica of the primary
VM(PVM), and can successfully take over when a hardware failure of the
PVM is detected.

This patchset is RFC, and implements the frame of colo:
1. Both PVM and SVM are running
2. Forward the input packets from client to secondary machine(slaver)
3. Forward the output packets from SVM to primary machine(master)
4. Compare the output packets from PVM and SVM on the master side. If the
   output packets are different, do a checkpoint

Changelog:
  Patch 1: optimize the dirty pages transfer speed.
  Patch 2-3: allow SVM running after checkpoint
  Patch 4-5: modification for colo on the master side(wait a new checkpoint,
             communicate with slaver when doing checkoint)
  Patch 6-7: implement colo''s user interface

Wen Congyang (7):
  xc_domain_save: cache pages mapping
  xc_domain_restore: introduce restore_callbacks for colo
  colo: implement restore_callbacks
  xc_domain_save: flush cache before calling callbacks->postcopy()
  xc_domain_save: implement save_callbacks for colo
  XendCheckpoint: implement colo
  remus: implement colo mode

 tools/libxc/Makefile                              |   4 +-
 tools/libxc/ia64/xc_ia64_linux_restore.c          |   3 +-
 tools/libxc/xc_domain_restore.c                   | 256 +++++---
 tools/libxc/xc_domain_restore_colo.c              | 740 ++++++++++++++++++++++
 tools/libxc/xc_domain_save.c                      | 162 +++--
 tools/libxc/xc_save_restore_colo.h                |  44 ++
 tools/libxc/xenguest.h                            |  57 +-
 tools/libxl/libxl_dom.c                           |   2 +-
 tools/python/xen/lowlevel/checkpoint/checkpoint.c | 289 ++++++++-
 tools/python/xen/lowlevel/checkpoint/checkpoint.h |   2 +
 tools/python/xen/remus/image.py                   |   7 +-
 tools/python/xen/remus/save.py                    |   6 +-
 tools/python/xen/xend/XendCheckpoint.py           | 138 ++--
 tools/remus/remus                                 |   8 +-
 tools/xcutils/xc_restore.c                        |   3 +-
 xen/include/public/xen.h                          |   1 +
 16 files changed, 1503 insertions(+), 219 deletions(-)
 create mode 100644 tools/libxc/xc_domain_restore_colo.c
 create mode 100644 tools/libxc/xc_save_restore_colo.h

-- 
1.8.0

Wen Congyang

2013-Apr-03 08:02 UTC

head link

[RFC PATCH 1/7] xc_domain_save: cache pages mapping

We map the dirty pages, and copy them to secondary machine, and then unmap
it. xc_map_foreign_bulk() is too slow, so we can''t use full bandwidth
to
transfer dirty pages. In out test, the transfer speed is less than 300Mb/s.
For virtual machine (VM) replication, the transfer speed is very important,
so we should cache pages mapping to map the same page only one time. In
our test, the transfer speed is about 2Gb/s with this patch.

Signed-off-by: Ye Wei <wei.ye1987@gmail.com>
Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>

---
 tools/libxc/xc_domain_save.c | 113 +++++++++++++++++++++++++------------------
 1 file changed, 66 insertions(+), 47 deletions(-)

diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
index fa270f5..222aa03 100644
--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -896,6 +896,50 @@ static int save_tsc_info(xc_interface *xch, uint32_t dom,
int io_fd)
     return 0;
 }
 
+/* big cache to avoid future map */
+static char **pages_base;
+
+static int colo_ro_map_and_cache(xc_interface *xch, uint32_t dom,
+                                 unsigned long *pfn_batch, xen_pfn_t *pfn_type,
+                                 int *pfn_err, int batch)
+{
+    static xen_pfn_t cache_pfn_type[MAX_BATCH_SIZE];
+    static int cache_pfn_err[MAX_BATCH_SIZE];
+    int i, cache_batch = 0;
+    char *map;
+
+    for (i = 0; i < batch; i++)
+    {
+        if (!pages_base[pfn_batch[i]])
+            cache_pfn_type[cache_batch++] = pfn_type[i];
+    }
+
+    if (cache_batch)
+    {
+        map = xc_map_foreign_bulk(xch, dom, PROT_READ, cache_pfn_type,
cache_pfn_err, cache_batch);
+        if (!map)
+            return -1;
+    }
+
+    cache_batch = 0;
+    for (i = 0; i < batch; i++)
+    {
+        if (pages_base[pfn_batch[i]])
+        {
+            pfn_err[i] = 0;
+        }
+        else
+        {
+            if (!cache_pfn_err[cache_batch])
+                pages_base[pfn_batch[i]] = map + PAGE_SIZE * cache_batch;
+            pfn_err[i] = cache_pfn_err[cache_batch];
+            cache_batch++;
+        }
+    }
+
+    return 0;
+}
+
 int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t
max_iters,
                    uint32_t max_factor, uint32_t flags,
                    struct save_callbacks* callbacks, int hvm)
@@ -927,9 +971,6 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t
dom, uint32_t max_iter
     /* Live mapping of shared info structure */
     shared_info_any_t *live_shinfo = NULL;
 
-    /* base of the region in which domain memory is mapped */
-    unsigned char *region_base = NULL;
-
     /* A copy of the CPU eXtended States of the guest. */
     DECLARE_HYPERCALL_BUFFER(void, buffer);
 
@@ -1111,6 +1152,14 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t
dom, uint32_t max_iter
     memset(pfn_type, 0,
            ROUNDUP(MAX_BATCH_SIZE * sizeof(*pfn_type), PAGE_SHIFT));
 
+    pages_base = calloc(dinfo->p2m_size, sizeof(*pages_base));
+    if (!pages_base)
+    {
+        ERROR("failed to alloc memory to cache page mapping");
+        errno = ENOMEM;
+        goto out;
+    }
+
     /* Setup the mfn_to_pfn table mapping */
     if ( !(ctx->live_m2p = xc_map_m2p(xch, ctx->max_mfn, PROT_READ,
&ctx->m2p_mfn0)) )
     {
@@ -1308,9 +1357,8 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t
dom, uint32_t max_iter
             if ( batch == 0 )
                 goto skip; /* vanishingly unlikely... */
 
-            region_base = xc_map_foreign_bulk(
-                xch, dom, PROT_READ, pfn_type, pfn_err, batch);
-            if ( region_base == NULL )
+            if (colo_ro_map_and_cache(xch, dom, pfn_batch, pfn_type, pfn_err,
+                                      batch) < 0)
             {
                 PERROR("map batch failed");
                 goto out;
@@ -1356,7 +1404,7 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t
dom, uint32_t max_iter
                         DPRINTF("%d pfn=%08lx sum=%08lx\n",
                                 iter,
                                 pfn_type[j],
-                                csum_page(region_base + (PAGE_SIZE*j)));
+                                csum_page(pages_base[pfn_batch[j]]));
                     else
                         DPRINTF("%d pfn= %08lx mfn= %08lx [mfn]=
%08lx"
                                 " sum= %08lx\n",
@@ -1364,13 +1412,12 @@ int xc_domain_save(xc_interface *xch, int io_fd,
uint32_t dom, uint32_t max_iter
                                 pfn_type[j],
                                 gmfn,
                                 mfn_to_pfn(gmfn),
-                                csum_page(region_base + (PAGE_SIZE*j)));
+                                csum_page(pages_base[pfn_batch[j]]));
                 }
             }
 
             if ( !run )
             {
-                munmap(region_base, batch*PAGE_SIZE);
                 continue; /* bail on this batch: no valid pages */
             }
 
@@ -1393,33 +1440,14 @@ int xc_domain_save(xc_interface *xch, int io_fd,
uint32_t dom, uint32_t max_iter
                     pfn_type[j] = ((unsigned long *)pfn_type)[j];
 
             /* entering this loop, pfn_type is now in pfns (Not mfns) */
-            run = 0;
             for ( j = 0; j < batch; j++ )
             {
                 unsigned long pfn, pagetype;
-                void *spage = (char *)region_base + (PAGE_SIZE*j);
+                void *spage = pages_base[pfn_batch[j]];
 
                 pfn      = pfn_type[j] & ~XEN_DOMCTL_PFINFO_LTAB_MASK;
                 pagetype = pfn_type[j] &  XEN_DOMCTL_PFINFO_LTAB_MASK;
 
-                if ( pagetype != 0 )
-                {
-                    /* If the page is not a normal data page, write out any
-                       run of pages we may have previously acumulated */
-                    if ( run )
-                    {
-                        if ( ratewrite(io_fd, live, 
-                                       (char*)region_base+(PAGE_SIZE*(j-run)), 
-                                       PAGE_SIZE*run) != PAGE_SIZE*run )
-                        {
-                            PERROR("Error when writing to state file
(4a)"
-                                  " (errno %d)", errno);
-                            goto out;
-                        }                        
-                        run = 0;
-                    }
-                }
-
                 /* skip pages that aren''t present */
                 if ( pagetype == XEN_DOMCTL_PFINFO_XTAB )
                     continue;
@@ -1449,28 +1477,19 @@ int xc_domain_save(xc_interface *xch, int io_fd,
uint32_t dom, uint32_t max_iter
                 }
                 else
                 {
-                    /* We have a normal page: accumulate it for writing. */
-                    run++;
+                    /* stop accumulate write temporarily. we will add it
+                     * back via writev() when needed.
+                     */
+                    if (ratewrite(io_fd, live, spage, PAGE_SIZE) != PAGE_SIZE)
+                    {
+                        PERROR("Error when writing to state file
(4c)"
+                               " (errno %d)", errno);
+                        goto out;
+                    }
                 }
             } /* end of the write out for this batch */
 
-            if ( run )
-            {
-                /* write out the last accumulated run of pages */
-                if ( ratewrite(io_fd, live, 
-                               (char*)region_base+(PAGE_SIZE*(j-run)), 
-                               PAGE_SIZE*run) != PAGE_SIZE*run )
-                {
-                    PERROR("Error when writing to state file (4c)"
-                          " (errno %d)", errno);
-                    goto out;
-                }                        
-            }
-
             sent_this_iter += batch;
-
-            munmap(region_base, batch*PAGE_SIZE);
-
         } /* end of this while loop for this iteration */
 
       skip:
-- 
1.8.0

Wen Congyang

2013-Apr-03 08:02 UTC

head link

[RFC PATCH 2/7] xc_domain_restore: introduce restore_callbacks for colo

In colo mode, SVM also runs. So we should update xc_restore to support it.
The first step is: add some callbacks for colo.

We add the following callbacks:
1. init(): init the private data used for colo
2. free(): free the resource we allocate and store in the private data
3. get_page(): SVM runs, so we can''t update the memory in
apply_batch().
   This callback will return a page buffer, and apply_batch() will copy
   the page to this buffer. The content of this buffer should be the current
   content of this page, so we can use it to do verify.
4. flush_memory(): update the SVM memory and pagetable.
5. update_p2m(): update the SVM p2m page.
6. finish_restore(): wait a new checkpoint.

We also add a new structure restore_data to avoid pass too many arguments
to these callbacks. This structure stores the variables used in
xc_domain_store(), and these variables will be used in the callback.

Signed-off-by: Ye Wei <wei.ye1987@gmail.com>
Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>

---
 tools/libxc/ia64/xc_ia64_linux_restore.c |   3 +-
 tools/libxc/xc_domain_restore.c          | 256 +++++++++++++++++++++----------
 tools/libxc/xenguest.h                   |  54 ++++++-
 tools/libxl/libxl_dom.c                  |   2 +-
 tools/xcutils/xc_restore.c               |   3 +-
 5 files changed, 230 insertions(+), 88 deletions(-)

diff --git a/tools/libxc/ia64/xc_ia64_linux_restore.c
b/tools/libxc/ia64/xc_ia64_linux_restore.c
index b4e9e9c..ca76be6 100644
--- a/tools/libxc/ia64/xc_ia64_linux_restore.c
+++ b/tools/libxc/ia64/xc_ia64_linux_restore.c
@@ -550,7 +550,8 @@ int
 xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
                   unsigned int store_evtchn, unsigned long *store_mfn,
                   unsigned int console_evtchn, unsigned long *console_mfn,
-                  unsigned int hvm, unsigned int pae, int superpages)
+                  unsigned int hvm, unsigned int pae, int superpages,
+                  struct restore_callbacks *callbacks)
 {
     DECLARE_DOMCTL;
     int rc = 1;
diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
index 43e6c52..fa828e9 100644
--- a/tools/libxc/xc_domain_restore.c
+++ b/tools/libxc/xc_domain_restore.c
@@ -882,13 +882,15 @@ static int pagebuf_get(xc_interface *xch, struct
restore_ctx *ctx,
 static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx
*ctx,
                        xen_pfn_t* region_mfn, unsigned long* pfn_type, int
pae_extended_cr3,
                        unsigned int hvm, struct xc_mmu* mmu,
-                       pagebuf_t* pagebuf, int curbatch)
+                       pagebuf_t* pagebuf, int curbatch,
+                       struct restore_callbacks *callbacks)
 {
     int i, j, curpage, nr_mfns;
     /* used by debug verify code */
     unsigned long buf[PAGE_SIZE/sizeof(unsigned long)];
     /* Our mapping of the current region (batch) */
     char *region_base;
+    char *target_buf;
     /* A temporary mapping, and a copy, of one frame of guest memory. */
     unsigned long *page = NULL;
     int nraces = 0;
@@ -954,16 +956,19 @@ static int apply_batch(xc_interface *xch, uint32_t dom,
struct restore_ctx *ctx,
         }
     }
 
-    /* Map relevant mfns */
-    pfn_err = calloc(j, sizeof(*pfn_err));
-    region_base = xc_map_foreign_bulk(
-        xch, dom, PROT_WRITE, region_mfn, pfn_err, j);
-
-    if ( region_base == NULL )
+    if ( !callbacks || !callbacks->get_page)
     {
-        PERROR("map batch failed");
-        free(pfn_err);
-        return -1;
+        /* Map relevant mfns */
+        pfn_err = calloc(j, sizeof(*pfn_err));
+        region_base = xc_map_foreign_bulk(
+            xch, dom, PROT_WRITE, region_mfn, pfn_err, j);
+
+        if ( region_base == NULL )
+        {
+            PERROR("map batch failed");
+            free(pfn_err);
+            return -1;
+        }
     }
 
     for ( i = 0, curpage = -1; i < j; i++ )
@@ -975,7 +980,7 @@ static int apply_batch(xc_interface *xch, uint32_t dom,
struct restore_ctx *ctx,
             /* a bogus/unmapped page: skip it */
             continue;
 
-        if (pfn_err[i])
+        if ( (!callbacks || !callbacks->get_page) && pfn_err[i] )
         {
             ERROR("unexpected PFN mapping failure");
             goto err_mapped;
@@ -993,8 +998,20 @@ static int apply_batch(xc_interface *xch, uint32_t dom,
struct restore_ctx *ctx,
 
         mfn = ctx->p2m[pfn];
 
+        if ( callbacks && callbacks->get_page )
+        {
+            target_buf = callbacks->get_page(&callbacks->comm_data,
+                                             callbacks->data, pfn);
+            if ( !target_buf )
+            {
+                ERROR("Cannot get a buffer to store memory");
+                goto err_mapped;
+            }
+        }
+        else
+            target_buf = region_base + i*PAGE_SIZE;
         /* In verify mode, we use a copy; otherwise we work in place */
-        page = pagebuf->verify ? (void *)buf : (region_base + i*PAGE_SIZE);
+        page = pagebuf->verify ? (void *)buf : target_buf;
 
         memcpy(page, pagebuf->pages + (curpage + curbatch) * PAGE_SIZE,
PAGE_SIZE);
 
@@ -1038,27 +1055,26 @@ static int apply_batch(xc_interface *xch, uint32_t dom,
struct restore_ctx *ctx,
 
         if ( pagebuf->verify )
         {
-            int res = memcmp(buf, (region_base + i*PAGE_SIZE), PAGE_SIZE);
+            int res = memcmp(buf, target_buf, PAGE_SIZE);
             if ( res )
             {
                 int v;
 
                 DPRINTF("************** pfn=%lx type=%lx gotcs=%08lx
"
                         "actualcs=%08lx\n", pfn,
pagebuf->pfn_types[pfn],
-                        csum_page(region_base + (i + curbatch)*PAGE_SIZE),
+                        csum_page(target_buf),
                         csum_page(buf));
 
                 for ( v = 0; v < 4; v++ )
                 {
-                    unsigned long *p = (unsigned long *)
-                        (region_base + i*PAGE_SIZE);
+                    unsigned long *p = (unsigned long *)target_buf;
                     if ( buf[v] != p[v] )
                         DPRINTF("    %d: %08lx %08lx\n", v, buf[v],
p[v]);
                 }
             }
         }
 
-        if ( !hvm &&
+        if ( (!callbacks || !callbacks->get_page) && !hvm &&
              xc_add_mmu_update(xch, mmu,
                                (((unsigned long long)mfn) << PAGE_SHIFT)
                                | MMU_MACHPHYS_UPDATE, pfn) )
@@ -1071,8 +1087,11 @@ static int apply_batch(xc_interface *xch, uint32_t dom,
struct restore_ctx *ctx,
     rc = nraces;
 
   err_mapped:
-    munmap(region_base, j*PAGE_SIZE);
-    free(pfn_err);
+    if ( !callbacks || !callbacks->get_page )
+    {
+        munmap(region_base, j*PAGE_SIZE);
+        free(pfn_err);
+    }
 
     return rc;
 }
@@ -1080,7 +1099,8 @@ static int apply_batch(xc_interface *xch, uint32_t dom,
struct restore_ctx *ctx,
 int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
                       unsigned int store_evtchn, unsigned long *store_mfn,
                       unsigned int console_evtchn, unsigned long *console_mfn,
-                      unsigned int hvm, unsigned int pae, int superpages)
+                      unsigned int hvm, unsigned int pae, int superpages,
+                      struct restore_callbacks *callbacks)
 {
     DECLARE_DOMCTL;
     int rc = 1, frc, i, j, n, m, pae_extended_cr3 = 0, ext_vcpucontext = 0;
@@ -1141,6 +1161,9 @@ int xc_domain_restore(xc_interface *xch, int io_fd,
uint32_t dom,
     static struct restore_ctx *ctx = &_ctx;
     struct domain_info_context *dinfo = &ctx->dinfo;
 
+    struct restore_data *comm_data = NULL;
+    void *data = NULL;
+
     pagebuf_init(&pagebuf);
     memset(&tailbuf, 0, sizeof(tailbuf));
     tailbuf.ishvm = hvm;
@@ -1249,6 +1272,32 @@ int xc_domain_restore(xc_interface *xch, int io_fd,
uint32_t dom,
         goto out;
     }
 
+    /* init callbacks->comm_data */
+    if ( callbacks )
+    {
+        callbacks->comm_data.xch = xch;
+        callbacks->comm_data.dom = dom;
+        callbacks->comm_data.dinfo = dinfo;
+        callbacks->comm_data.hvm = hvm;
+        callbacks->comm_data.pfn_type = pfn_type;
+        callbacks->comm_data.mmu = mmu;
+        callbacks->comm_data.p2m_frame_list = p2m_frame_list;
+        callbacks->comm_data.p2m = ctx->p2m;
+        comm_data = &callbacks->comm_data;
+
+        /* init callbacks->data */
+        if ( callbacks->init)
+        {
+            callbacks->data = NULL;
+            if (callbacks->init(&callbacks->comm_data,
&callbacks->data) < 0 )
+            {
+                ERROR("Could not initialise restore callbacks private
data");
+                goto out;
+            }
+        }
+        data = callbacks->data;
+    }
+
     xc_report_progress_start(xch, "Reloading memory pages",
dinfo->p2m_size);
 
     /*
@@ -1298,7 +1347,8 @@ int xc_domain_restore(xc_interface *xch, int io_fd,
uint32_t dom,
             int brc;
 
             brc = apply_batch(xch, dom, ctx, region_mfn, pfn_type,
-                              pae_extended_cr3, hvm, mmu, &pagebuf,
curbatch);
+                              pae_extended_cr3, hvm, mmu, &pagebuf,
curbatch,
+                              callbacks);
             if ( brc < 0 )
                 goto out;
 
@@ -1368,6 +1418,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd,
uint32_t dom,
         goto finish;
     }
 
+getpages:
     // DPRINTF("Buffered checkpoint\n");
 
     if ( pagebuf_get(xch, ctx, &pagebuf, io_fd, dom) ) {
@@ -1499,58 +1550,69 @@ int xc_domain_restore(xc_interface *xch, int io_fd,
uint32_t dom,
         }
     }
 
-    /*
-     * Pin page tables. Do this after writing to them as otherwise Xen
-     * will barf when doing the type-checking.
-     */
-    nr_pins = 0;
-    for ( i = 0; i < dinfo->p2m_size; i++ )
+    if ( callbacks && callbacks->flush_memory )
     {
-        if ( (pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 )
-            continue;
-
-        switch ( pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        if ( callbacks->flush_memory(comm_data, data) < 0 )
         {
-        case XEN_DOMCTL_PFINFO_L1TAB:
-            pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE;
-            break;
+            ERROR("Error doing callbacks->flush_memory()");
+            goto out;
+        }
+    }
+    else
+    {
+        /*
+         * Pin page tables. Do this after writing to them as otherwise Xen
+         * will barf when doing the type-checking.
+         */
+        nr_pins = 0;
+        for ( i = 0; i < dinfo->p2m_size; i++ )
+        {
+            if ( (pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 )
+                continue;
 
-        case XEN_DOMCTL_PFINFO_L2TAB:
-            pin[nr_pins].cmd = MMUEXT_PIN_L2_TABLE;
-            break;
+            switch ( pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+            {
+            case XEN_DOMCTL_PFINFO_L1TAB:
+                pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE;
+                break;
 
-        case XEN_DOMCTL_PFINFO_L3TAB:
-            pin[nr_pins].cmd = MMUEXT_PIN_L3_TABLE;
-            break;
+            case XEN_DOMCTL_PFINFO_L2TAB:
+                pin[nr_pins].cmd = MMUEXT_PIN_L2_TABLE;
+                break;
 
-        case XEN_DOMCTL_PFINFO_L4TAB:
-            pin[nr_pins].cmd = MMUEXT_PIN_L4_TABLE;
-            break;
+            case XEN_DOMCTL_PFINFO_L3TAB:
+                pin[nr_pins].cmd = MMUEXT_PIN_L3_TABLE;
+                break;
 
-        default:
-            continue;
-        }
+            case XEN_DOMCTL_PFINFO_L4TAB:
+                pin[nr_pins].cmd = MMUEXT_PIN_L4_TABLE;
+                break;
+
+            default:
+                continue;
+            }
 
-        pin[nr_pins].arg1.mfn = ctx->p2m[i];
-        nr_pins++;
+            pin[nr_pins].arg1.mfn = ctx->p2m[i];
+            nr_pins++;
 
-        /* Batch full? Then flush. */
-        if ( nr_pins == MAX_PIN_BATCH )
-        {
-            if ( xc_mmuext_op(xch, pin, nr_pins, dom) < 0 )
+            /* Batch full? Then flush. */
+            if ( nr_pins == MAX_PIN_BATCH )
             {
-                PERROR("Failed to pin batch of %d page tables",
nr_pins);
-                goto out;
+                if ( xc_mmuext_op(xch, pin, nr_pins, dom) < 0 )
+                {
+                    PERROR("Failed to pin batch of %d page tables",
nr_pins);
+                    goto out;
+                }
+                nr_pins = 0;
             }
-            nr_pins = 0;
         }
-    }
 
-    /* Flush final partial batch. */
-    if ( (nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) <
0) )
-    {
-        PERROR("Failed to pin batch of %d page tables", nr_pins);
-        goto out;
+        /* Flush final partial batch. */
+        if ( (nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom)
< 0) )
+        {
+            PERROR("Failed to pin batch of %d page tables", nr_pins);
+            goto out;
+        }
     }
 
     DPRINTF("Memory reloaded (%ld pages)\n", ctx->nr_pfns);
@@ -1767,37 +1829,61 @@ int xc_domain_restore(xc_interface *xch, int io_fd,
uint32_t dom,
     /* leave wallclock time. set by hypervisor */
     munmap(new_shared_info, PAGE_SIZE);
 
-    /* Uncanonicalise the pfn-to-mfn table frame-number list. */
-    for ( i = 0; i < P2M_FL_ENTRIES; i++ )
+    if ( callbacks && callbacks->update_p2m )
     {
-        pfn = p2m_frame_list[i];
-        if ( (pfn >= dinfo->p2m_size) || (pfn_type[pfn] !=
XEN_DOMCTL_PFINFO_NOTAB) )
+        if ( callbacks->update_p2m(comm_data, data) < 0 )
         {
-            ERROR("PFN-to-MFN frame number %i (%#lx) is bad", i,
pfn);
+            ERROR("Error doing callbacks->update_p2m()");
             goto out;
         }
-        p2m_frame_list[i] = ctx->p2m[pfn];
     }
-
-    /* Copy the P2M we''ve constructed to the ''live''
P2M */
-    if ( !(ctx->live_p2m = xc_map_foreign_pages(xch, dom, PROT_WRITE,
-                                           p2m_frame_list, P2M_FL_ENTRIES)) )
+    else
     {
-        PERROR("Couldn''t map p2m table");
-        goto out;
+        /* Uncanonicalise the pfn-to-mfn table frame-number list. */
+        for ( i = 0; i < P2M_FL_ENTRIES; i++ )
+        {
+            pfn = p2m_frame_list[i];
+            if ( (pfn >= dinfo->p2m_size) || (pfn_type[pfn] !=
XEN_DOMCTL_PFINFO_NOTAB) )
+            {
+                ERROR("PFN-to-MFN frame number %i (%#lx) is bad", i,
pfn);
+                goto out;
+            }
+            p2m_frame_list[i] = ctx->p2m[pfn];
+        }
+
+        /* Copy the P2M we''ve constructed to the
''live'' P2M */
+        if ( !(ctx->live_p2m = xc_map_foreign_pages(xch, dom, PROT_WRITE,
+                                               p2m_frame_list, P2M_FL_ENTRIES))
)
+        {
+            PERROR("Couldn''t map p2m table");
+            goto out;
+        }
+
+        /* If the domain we''re restoring has a different word size to
ours,
+         * we need to adjust the live_p2m assignment appropriately */
+        if ( dinfo->guest_width > sizeof (xen_pfn_t) )
+            for ( i = dinfo->p2m_size - 1; i >= 0; i-- )
+                ((int64_t *)ctx->live_p2m)[i] = (long)ctx->p2m[i];
+        else if ( dinfo->guest_width < sizeof (xen_pfn_t) )
+            for ( i = 0; i < dinfo->p2m_size; i++ )
+                ((uint32_t *)ctx->live_p2m)[i] = ctx->p2m[i];
+        else
+            memcpy(ctx->live_p2m, ctx->p2m, dinfo->p2m_size *
sizeof(xen_pfn_t));
+        munmap(ctx->live_p2m, P2M_FL_ENTRIES * PAGE_SIZE);
     }
 
-    /* If the domain we''re restoring has a different word size to
ours,
-     * we need to adjust the live_p2m assignment appropriately */
-    if ( dinfo->guest_width > sizeof (xen_pfn_t) )
-        for ( i = dinfo->p2m_size - 1; i >= 0; i-- )
-            ((int64_t *)ctx->live_p2m)[i] = (long)ctx->p2m[i];
-    else if ( dinfo->guest_width < sizeof (xen_pfn_t) )
-        for ( i = 0; i < dinfo->p2m_size; i++ )   
-            ((uint32_t *)ctx->live_p2m)[i] = ctx->p2m[i];
-    else
-        memcpy(ctx->live_p2m, ctx->p2m, dinfo->p2m_size *
sizeof(xen_pfn_t));
-    munmap(ctx->live_p2m, P2M_FL_ENTRIES * PAGE_SIZE);
+    if ( callbacks && callbacks->finish_restotre )
+    {
+        rc = callbacks->finish_restotre(comm_data, data);
+        if ( rc == 1 )
+            goto getpages;
+
+        if ( rc < 0 )
+        {
+            ERROR("Er1ror doing callbacks->finish_restotre()");
+            goto out;
+        }
+    }
 
     DPRINTF("Domain ready to be built.\n");
     rc = 0;
@@ -1861,6 +1947,8 @@ int xc_domain_restore(xc_interface *xch, int io_fd,
uint32_t dom,
     rc = 0;
 
  out:
+    if ( callbacks && callbacks->free && callbacks->data)
+        callbacks->free(&callbacks->comm_data, callbacks->data);
     if ( (rc != 0) && (dom != 0) )
         xc_domain_destroy(xch, dom);
     xc_hypercall_buffer_free(xch, ctxt);
diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h
index 9ed0ea4..709a284 100644
--- a/tools/libxc/xenguest.h
+++ b/tools/libxc/xenguest.h
@@ -60,6 +60,57 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t
dom, uint32_t max_iter
                    struct save_callbacks* callbacks, int hvm);
 
 
+/* pass the variable defined in xc_domain_restore() to callback. Use
+ * this structure for the following purpose:
+ *   1. avoid too many arguments.
+ *   2. different callback implemention may need different arguments.
+ *      Just add the information you need here.
+ */
+struct restore_data
+{
+    xc_interface *xch;
+    uint32_t dom;
+    struct domain_info_context *dinfo;
+    int hvm;
+    unsigned long *pfn_type;
+    struct xc_mmu *mmu;
+    unsigned long *p2m_frame_list;
+    unsigned long *p2m;
+};
+
+/* callbacks provided by xc_domain_restore */
+struct restore_callbacks {
+    /* callback to init data */
+    int (*init)(struct restore_data *comm_data, void **data);
+    /* callback to free data */
+    void (*free)(struct restore_data *comm_data, void *data);
+    /* callback to get a buffer to store memory data that is transfered
+     * from the source machine.
+     */
+    char *(*get_page)(struct restore_data *comm_data, void *data,
+                     unsigned long pfn);
+    /* callback to flush memory that is transfered from the source machine
+     * to the guest. Update the guest''s pagetable if necessary.
+     */
+    int (*flush_memory)(struct restore_data *comm_data, void *data);
+    /* callback to update the guest''s p2m table */
+    int (*update_p2m)(struct restore_data *comm_data, void *data);
+    /* callback to finish restore process. It is called before
xc_domain_restore()
+     * returns.
+     *
+     * Return value:
+     *   -1: error
+     *    0: continue to start vm
+     *    1: continue to do a checkpoint
+     */
+    int (*finish_restotre)(struct restore_data *comm_data, void *data);
+
+    /* xc_domain_restore() init it */
+    struct restore_data comm_data;
+    /* to be provided as the last argument to each callback function */
+    void* data;
+};
+
 /**
  * This function will restore a saved domain.
  *
@@ -76,7 +127,8 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t
dom, uint32_t max_iter
 int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
                       unsigned int store_evtchn, unsigned long *store_mfn,
                       unsigned int console_evtchn, unsigned long *console_mfn,
-                      unsigned int hvm, unsigned int pae, int superpages);
+                      unsigned int hvm, unsigned int pae, int superpages,
+                      struct restore_callbacks *callbacks);
 /**
  * xc_domain_restore writes a file to disk that contains the device
  * model saved state.
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index c702cf7..32cdd03 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -305,7 +305,7 @@ int libxl__domain_restore_common(libxl_ctx *ctx, uint32_t
domid,
     rc = xc_domain_restore(ctx->xch, fd, domid,
                              state->store_port, &state->store_mfn,
                              state->console_port,
&state->console_mfn,
-                             info->hvm, info->u.hvm.pae, 0);
+                             info->hvm, info->u.hvm.pae, 0, NULL);
     if ( rc ) {
         LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "restoring domain");
         return ERROR_FAIL;
diff --git a/tools/xcutils/xc_restore.c b/tools/xcutils/xc_restore.c
index ea069ac..8af88e4 100644
--- a/tools/xcutils/xc_restore.c
+++ b/tools/xcutils/xc_restore.c
@@ -46,7 +46,8 @@ main(int argc, char **argv)
 	    superpages = 0;
 
     ret = xc_domain_restore(xch, io_fd, domid, store_evtchn, &store_mfn,
-                            console_evtchn, &console_mfn, hvm, pae,
superpages);
+                            console_evtchn, &console_mfn, hvm, pae,
superpages,
+                            NULL);
 
     if ( ret == 0 )
     {
-- 
1.8.0

Wen Congyang

2013-Apr-03 08:02 UTC

head link

[RFC PATCH 3/7] colo: implement restore_callbacks

This patch implements restore callbacks for colo:
1. init(): allocate some memory
2. free(): free the memory allocated in init()
3. get_page(): We have cache the whole memory, so just return the buffer.
               This page is also marked as dirty.
4. flush_memory():
        We update the memory as the following:
        a. pin non-dirty L1 pagetables
        b. unpin pagetables execpt non-dirty L1
        c. update the memory
        d. pin page tables
        e. unpin non-dirty L1 pagetables
5. update_p2m(): Just update the dirty pages which store p2m.
6. finish_store():
        We run xc_restore in XendCheckpoint.py. We communicate with
        XendCheckpoint.py like this:
        a. write "finish\n" to stdout when we are ready to resume the
vm.
        b. XendCheckpoint.py writes "resume\n" when the vm is resumed
        c. write "resume\n" to stdout when postresume is done
        d. XendCheckpoint.py writes "suspend\n" when a new checkpoint
begins
        e. write "suspend\n" to stdout when the vm is suspended
        f. XendCheckpoint.py writes "start\n" when primary begins to
transfer
           dirty pages.

Signed-off-by: Ye Wei <wei.ye1987@gmail.com>
Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxc/Makefile                 |   4 +-
 tools/libxc/xc_domain_restore_colo.c | 740 +++++++++++++++++++++++++++++++++++
 tools/libxc/xc_domain_save.c         |  34 +-
 tools/libxc/xc_save_restore_colo.h   |  44 +++
 xen/include/public/xen.h             |   1 +
 5 files changed, 788 insertions(+), 35 deletions(-)
 create mode 100644 tools/libxc/xc_domain_restore_colo.c
 create mode 100644 tools/libxc/xc_save_restore_colo.h

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index 5a7677e..e2d059d 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -42,12 +42,12 @@ CTRL_SRCS-$(CONFIG_MiniOS) += xc_minios.c
 
 GUEST_SRCS-y : GUEST_SRCS-y += xg_private.c xc_suspend.c
-GUEST_SRCS-$(CONFIG_MIGRATE) += xc_domain_restore.c xc_domain_save.c
+GUEST_SRCS-$(CONFIG_MIGRATE) += xc_domain_restore.c xc_domain_save.c
xc_domain_restore_colo.c
 GUEST_SRCS-$(CONFIG_MIGRATE) += xc_offline_page.c
 GUEST_SRCS-$(CONFIG_HVM) += xc_hvm_build.c
 
 vpath %.c ../../xen/common/libelf
-CFLAGS += -I../../xen/common/libelf
+CFLAGS += -I../../xen/common/libelf -I../xenstore
 
 GUEST_SRCS-y += libelf-tools.c libelf-loader.c
 GUEST_SRCS-y += libelf-dominfo.c libelf-relocate.c
diff --git a/tools/libxc/xc_domain_restore_colo.c
b/tools/libxc/xc_domain_restore_colo.c
new file mode 100644
index 0000000..ffc7daa
--- /dev/null
+++ b/tools/libxc/xc_domain_restore_colo.c
@@ -0,0 +1,740 @@
+#include <xc_save_restore_colo.h>
+#include <xs.h>
+
+struct restore_colo_data
+{
+    /* store the pfn type on slave side */
+    unsigned long *pfn_type_slaver;
+    unsigned long max_mem_pfn;
+
+    /* cache the whole memory */
+    char* pagebase;
+
+    /* which page is dirty? */
+    unsigned long *dirty_pages;
+
+    /* suspend evtchn */
+    int local_port;
+
+    xc_evtchn *xce;
+
+    /* temp buffer(avoid malloc/free frequently) */
+    unsigned long *pfn_batch_slaver;
+    unsigned long *pfn_type_batch_slaver;
+    unsigned long *p2m_frame_list_temp;
+
+    int first_time;
+};
+
+/* we restore only one vm in a process, so it is same to use global variable */
+DECLARE_HYPERCALL_BUFFER(unsigned long, dirty_pages);
+
+int restore_colo_init(struct restore_data *comm_data, void **data)
+{
+    xc_dominfo_t info;
+    int i;
+    unsigned long size;
+    xc_interface *xch = comm_data->xch;
+    struct restore_colo_data *colo_data;
+    struct domain_info_context *dinfo = comm_data->dinfo;
+
+    if (comm_data->hvm)
+        /* hvm is unsupported now */
+        return -1;
+
+    if (dirty_pages)
+        /* restore_colo_init() is called more than once?? */
+        return -1;
+
+    colo_data = calloc(1, sizeof(struct restore_colo_data));
+    if (!colo_data)
+        return -1;
+
+    if (xc_domain_getinfo(xch, comm_data->dom, 1, &info) != 1)
+    {
+        PERROR("Could not get domain info");
+        goto err;
+    }
+
+    colo_data->max_mem_pfn = info.max_memkb >> (PAGE_SHIFT - 10);
+
+    colo_data->pfn_type_slaver = calloc(dinfo->p2m_size,
sizeof(xen_pfn_t));
+    colo_data->pfn_batch_slaver = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t));
+    colo_data->pfn_type_batch_slaver = calloc(MAX_BATCH_SIZE,
sizeof(xen_pfn_t));
+    colo_data->p2m_frame_list_temp = malloc(P2M_FL_ENTRIES);
+
+    dirty_pages = xc_hypercall_buffer_alloc_pages(xch, dirty_pages,
NRPAGES(BITMAP_SIZE));
+    colo_data->dirty_pages = dirty_pages;
+
+    size = dinfo->p2m_size * PAGE_SIZE;
+    colo_data->pagebase = malloc(size);
+    if (colo_data->pfn_type_slaver || colo_data->pfn_batch_slaver ||
+        colo_data->pfn_type_batch_slaver ||
colo_data->p2m_frame_list_temp ||
+        colo_data->dirty_pages || colo_data->pagebase) {
+        PERROR("Could not allocate memory for restore colo data");
+        goto err;
+    }
+
+    colo_data->xce = xc_evtchn_open(NULL, 0);
+    if (!colo_data->xce) {
+        PERROR("Could not open evtchn");
+        goto err;
+    }
+
+    for (i = 0; i < dinfo->p2m_size; i++)
+        comm_data->pfn_type[i] = XEN_DOMCTL_PFINFO_XTAB;
+    memset(dirty_pages, 0xff, BITMAP_SIZE);
+    colo_data->first_time = 1;
+    colo_data->local_port = -1;
+    *data = colo_data;
+
+    return 0;
+
+err:
+    restore_colo_free(comm_data, colo_data);
+    *data = NULL;
+    return -1;
+}
+
+void restore_colo_free(struct restore_data *comm_data, void *data)
+{
+    struct restore_colo_data *colo_data = data;
+
+    if (!colo_data)
+        return;
+
+    free(colo_data->pfn_type_slaver);
+    free(colo_data->pagebase);
+    free(colo_data->pfn_batch_slaver);
+    free(colo_data->pfn_type_batch_slaver);
+    free(colo_data->p2m_frame_list_temp);
+    if (dirty_pages)
+        xc_hypercall_buffer_free(comm_data->xch, dirty_pages);
+    if (colo_data->xce)
+        xc_evtchn_close(colo_data->xce);
+    free(colo_data);
+}
+
+char* get_page(struct restore_data *comm_data, void *data,
+               unsigned long pfn)
+{
+    struct restore_colo_data *colo_data = data;
+
+    set_bit(pfn, colo_data->dirty_pages);
+    return colo_data->pagebase + pfn * PAGE_SIZE;
+}
+
+/* Step1: pin non-dirty L1 pagetables: ~dirty_pages & mL1 (= ~dirty_pages
& sL1) */
+static int pin_l1(struct restore_data *comm_data,
+                  struct restore_colo_data *colo_data)
+{
+    unsigned int nr_pins = 0;
+    unsigned long i;
+    struct mmuext_op pin[MAX_PIN_BATCH];
+    struct domain_info_context *dinfo = comm_data->dinfo;
+    unsigned long *pfn_type = comm_data->pfn_type;
+    uint32_t dom = comm_data->dom;
+    xc_interface *xch = comm_data->xch;
+    unsigned long *pfn_type_slaver = colo_data->pfn_type_slaver;
+    unsigned long *dirty_pages = colo_data->dirty_pages;
+
+    for (i = 0; i < dinfo->p2m_size; i++)
+    {
+        switch ( pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        {
+        case XEN_DOMCTL_PFINFO_L1TAB:
+            if (pfn_type_slaver[i] & XEN_DOMCTL_PFINFO_LPINTAB)
+                /* don''t pin already pined */
+                continue;
+
+            if (test_bit(i, dirty_pages))
+                /* don''t pin dirty */
+                continue;
+
+            /* here, it must also be L1 in slaver, otherwise it is dirty.
+             * (add test code ?)
+             */
+            pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE;
+            break;
+
+        case XEN_DOMCTL_PFINFO_L2TAB:
+        case XEN_DOMCTL_PFINFO_L3TAB:
+        case XEN_DOMCTL_PFINFO_L4TAB:
+        default:
+            continue;
+        }
+
+        pin[nr_pins].arg1.mfn = comm_data->p2m[i];
+        nr_pins++;
+
+        /* Batch full? Then flush. */
+        if (nr_pins == MAX_PIN_BATCH)
+        {
+            if (xc_mmuext_op(xch, pin, nr_pins, dom) < 0)
+            {
+                PERROR("Failed to pin L1 batch of %d page tables",
nr_pins);
+                return 1;
+            }
+            nr_pins = 0;
+        }
+    }
+
+    /* Flush final partial batch. */
+    if ((nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) <
0))
+    {
+        PERROR("Failed to pin L1 batch of %d page tables", nr_pins);
+        return 1;
+    }
+
+    return 0;
+}
+
+/* Step2: unpin pagetables execpt non-dirty L1: sL2 + sL3 + sL4 + (dirty_pages
& sL1) */
+static int unpin_pagetable(struct restore_data *comm_data,
+                           struct restore_colo_data *colo_data)
+{
+    unsigned int nr_pins = 0;
+    unsigned long i;
+    struct mmuext_op pin[MAX_PIN_BATCH];
+    struct domain_info_context *dinfo = comm_data->dinfo;
+    uint32_t dom = comm_data->dom;
+    xc_interface *xch = comm_data->xch;
+    unsigned long *pfn_type_slaver = colo_data->pfn_type_slaver;
+    unsigned long *dirty_pages = colo_data->dirty_pages;
+
+    for (i = 0; i < dinfo->p2m_size; i++)
+    {
+        if ( (pfn_type_slaver[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 )
+            continue;
+
+        switch ( pfn_type_slaver[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        {
+        case XEN_DOMCTL_PFINFO_L1TAB:
+            if (!test_bit(i, dirty_pages)) // it is in (~dirty_pages &
mL1), keep it
+                continue;
+            // fallthrough
+        case XEN_DOMCTL_PFINFO_L2TAB:
+        case XEN_DOMCTL_PFINFO_L3TAB:
+        case XEN_DOMCTL_PFINFO_L4TAB:
+            pin[nr_pins].cmd = MMUEXT_UNPIN_TABLE;
+            break;
+
+        default:
+            continue;
+        }
+
+        pin[nr_pins].arg1.mfn = comm_data->p2m[i];
+        nr_pins++;
+
+        /* Batch full? Then flush. */
+        if (nr_pins == MAX_PIN_BATCH)
+        {
+            if (xc_mmuext_op(xch, pin, nr_pins, dom) < 0)
+            {
+                PERROR("Failed to unpin batch of %d page tables",
nr_pins);
+                return 1;
+            }
+            nr_pins = 0;
+        }
+    }
+
+    /* Flush final partial batch. */
+    if ((nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) <
0))
+    {
+        PERROR("Failed to unpin batch of %d page tables", nr_pins);
+        return 1;
+    }
+
+    return 0;
+}
+
+/* we have unpined all pagetables except non-diry l1. So it is OK to map the
dirty memory
+ * and update it.
+ */
+static int update_memory(struct restore_data *comm_data,
+                         struct restore_colo_data *colo_data)
+{
+    unsigned long pfn;
+    unsigned long max_mem_pfn = colo_data->max_mem_pfn;
+    unsigned long *pfn_type = comm_data->pfn_type;
+    unsigned long pagetype;
+    uint32_t dom = comm_data->dom;
+    xc_interface *xch = comm_data->xch;
+    int hvm = comm_data->hvm;
+    struct xc_mmu *mmu = comm_data->mmu;
+    unsigned long *dirty_pages = colo_data->dirty_pages;
+    char *pagebase = colo_data->pagebase;
+    int pfn_err = 0;
+    char *region_base_slaver;
+    xen_pfn_t region_mfn_slaver;
+    unsigned long mfn;
+    char *pagebuff;
+
+    for (pfn = 0; pfn < max_mem_pfn; pfn++) {
+        if ( !test_bit(pfn, dirty_pages) )
+            continue;
+
+        pagetype = pfn_type[pfn] & XEN_DOMCTL_PFINFO_LTAB_MASK;
+        if (pagetype == XEN_DOMCTL_PFINFO_XTAB)
+            /* a bogus/unmapped page: skip it */
+            continue;
+
+        mfn = comm_data->p2m[pfn];
+        region_mfn_slaver = mfn;
+        region_base_slaver = xc_map_foreign_bulk(xch, dom,
+                                    PROT_WRITE, &region_mfn_slaver,
&pfn_err, 1);
+        if (!region_base_slaver || pfn_err) {
+            PERROR("update_memory: xc_map_foreign_bulk failed");
+            return 1;
+        }
+
+        pagebuff = (char *)(pagebase + pfn * PAGE_SIZE);
+        memcpy(region_base_slaver, pagebuff, PAGE_SIZE);
+        munmap(region_base_slaver, PAGE_SIZE);
+
+        if (!hvm &&
+            xc_add_mmu_update(xch, mmu,
+                (((unsigned long long)mfn) << PAGE_SHIFT)
+                | MMU_MACHPHYS_UPDATE, pfn) )
+        {
+            PERROR("failed machpys update mfn=%lx pfn=%lx", mfn,
pfn);
+            return 1;
+        }
+    }
+
+    /*
+     * Ensure we flush all machphys updates before potential PAE-specific
+     * reallocations below.
+     */
+    if (!hvm && xc_flush_mmu_updates(xch, mmu))
+    {
+        PERROR("Error doing flush_mmu_updates()");
+        return 1;
+    }
+
+    return 0;
+}
+
+/* Step 4: pin master pt
+ * Pin page tables. Do this after writing to them as otherwise Xen
+ * will barf when doing the type-checking.
+ */
+static int pin_pagetable(struct restore_data *comm_data,
+                         struct restore_colo_data *colo_data)
+{
+    unsigned int nr_pins = 0;
+    unsigned long i;
+    struct mmuext_op pin[MAX_PIN_BATCH];
+    struct domain_info_context *dinfo = comm_data->dinfo;
+    unsigned long *pfn_type = comm_data->pfn_type;
+    uint32_t dom = comm_data->dom;
+    xc_interface *xch = comm_data->xch;
+    unsigned long *dirty_pages = colo_data->dirty_pages;
+
+    for ( i = 0; i < dinfo->p2m_size; i++ )
+    {
+        if ( (pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 )
+            continue;
+
+        switch ( pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        {
+        case XEN_DOMCTL_PFINFO_L1TAB:
+            if (!test_bit(i, dirty_pages))
+                /* it is in (~dirty_pages & mL1)(=~dirty_pages & sL1),
+                 * already pined
+                 */
+                continue;
+
+            pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE;
+            break;
+
+        case XEN_DOMCTL_PFINFO_L2TAB:
+            pin[nr_pins].cmd = MMUEXT_PIN_L2_TABLE;
+            break;
+
+        case XEN_DOMCTL_PFINFO_L3TAB:
+            pin[nr_pins].cmd = MMUEXT_PIN_L3_TABLE;
+            break;
+
+        case XEN_DOMCTL_PFINFO_L4TAB:
+            pin[nr_pins].cmd = MMUEXT_PIN_L4_TABLE;
+            break;
+
+        default:
+            continue;
+        }
+
+        pin[nr_pins].arg1.mfn = comm_data->p2m[i];
+        nr_pins++;
+
+        /* Batch full? Then flush. */
+        if (nr_pins == MAX_PIN_BATCH)
+        {
+            if (xc_mmuext_op(xch, pin, nr_pins, dom) < 0)
+            {
+                PERROR("Failed to pin batch of %d page tables",
nr_pins);
+                return 1;
+            }
+            nr_pins = 0;
+        }
+    }
+
+    /* Flush final partial batch. */
+    if ((nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) <
0))
+    {
+        PERROR("Failed to pin batch of %d page tables", nr_pins);
+        return 1;
+    }
+
+    return 0;
+}
+
+/* Step5: unpin unneeded non-dirty L1 pagetables: ~dirty_pages & mL1 (=
~dirty_pages & sL1) */
+static int unpin_l1(struct restore_data *comm_data,
+                    struct restore_colo_data *colo_data)
+{
+    unsigned int nr_pins = 0;
+    unsigned long i;
+    struct mmuext_op pin[MAX_PIN_BATCH];
+    struct domain_info_context *dinfo = comm_data->dinfo;
+    unsigned long *pfn_type = comm_data->pfn_type;
+    uint32_t dom = comm_data->dom;
+    xc_interface *xch = comm_data->xch;
+    unsigned long *pfn_type_slaver = colo_data->pfn_type_slaver;
+    unsigned long *dirty_pages = colo_data->dirty_pages;
+
+    for (i = 0; i < dinfo->p2m_size; i++)
+    {
+        switch ( pfn_type_slaver[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK )
+        {
+        case XEN_DOMCTL_PFINFO_L1TAB:
+            if (pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) // still needed
+                continue;
+            if (test_bit(i, dirty_pages)) // not pined by step 1
+                continue;
+
+            pin[nr_pins].cmd = MMUEXT_UNPIN_TABLE;
+            break;
+
+        case XEN_DOMCTL_PFINFO_L2TAB:
+        case XEN_DOMCTL_PFINFO_L3TAB:
+        case XEN_DOMCTL_PFINFO_L4TAB:
+        default:
+            continue;
+        }
+
+        pin[nr_pins].arg1.mfn = comm_data->p2m[i];
+        nr_pins++;
+
+        /* Batch full? Then flush. */
+        if (nr_pins == MAX_PIN_BATCH)
+        {
+            if (xc_mmuext_op(xch, pin, nr_pins, dom) < 0)
+            {
+                PERROR("Failed to pin L1 batch of %d page tables",
nr_pins);
+                return 1;
+            }
+            nr_pins = 0;
+        }
+    }
+
+    /* Flush final partial batch. */
+    if ((nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) <
0))
+    {
+        PERROR("Failed to pin L1 batch of %d page tables", nr_pins);
+        return 1;
+    }
+
+    return 0;
+}
+
+int flush_memory(struct restore_data *comm_data, void *data)
+{
+    struct restore_colo_data *colo_data = data;
+
+    if (pin_l1(comm_data, colo_data) != 0)
+        return -1;
+    if (unpin_pagetable(comm_data, colo_data) != 0)
+        return -1;
+    if (update_memory(comm_data, colo_data) != 0)
+        return -1;
+    if (pin_pagetable(comm_data, colo_data) != 0)
+        return -1;
+    if (unpin_l1(comm_data, colo_data) != 0)
+        return -1;
+
+    memcpy(colo_data->pfn_type_slaver, comm_data->pfn_type,
+           comm_data->dinfo->p2m_size * sizeof(xen_pfn_t));
+
+    return 0;
+}
+
+int update_p2m_table(struct restore_data *comm_data, void *data)
+{
+    struct restore_colo_data *colo_data = data;
+    unsigned long i, j, n, pfn;
+    unsigned long *p2m_frame_list = comm_data->p2m_frame_list;
+    struct domain_info_context *dinfo = comm_data->dinfo;
+    unsigned long *pfn_type = comm_data->pfn_type;
+    xc_interface *xch = comm_data->xch;
+    uint32_t dom = comm_data->dom;
+    unsigned long *dirty_pages = colo_data->dirty_pages;
+    unsigned long *p2m_frame_list_temp = colo_data->p2m_frame_list_temp;
+
+    /* A temporay mapping of the guest''s p2m table(all dirty pages) */
+    xen_pfn_t *live_p2m;
+    /* A temporay mapping of the guest''s p2m table(1 page) */
+    xen_pfn_t *live_p2m_one;
+    unsigned long *p2m;
+
+    j = 0;
+    for (i = 0; i < P2M_FL_ENTRIES; i++)
+    {
+        pfn = p2m_frame_list[i];
+        if ((pfn >= dinfo->p2m_size) || (pfn_type[pfn] !=
XEN_DOMCTL_PFINFO_NOTAB))
+        {
+            ERROR("PFN-to-MFN frame number %i (%#lx) is bad", i,
pfn);
+            return -1;
+        }
+
+        if (!test_bit(pfn, dirty_pages))
+            continue;
+
+        p2m_frame_list_temp[j++] = comm_data->p2m[pfn];
+    }
+
+    if (j)
+    {
+        /* Copy the P2M we''ve constructed to the
''live'' P2M */
+        if (!(live_p2m = xc_map_foreign_pages(xch, dom, PROT_WRITE,
+                                              p2m_frame_list_temp, j)))
+        {
+            PERROR("Couldn''t map p2m table");
+            return -1;
+        }
+
+        j = 0;
+        for (i = 0; i < P2M_FL_ENTRIES; i++)
+        {
+            pfn = p2m_frame_list[i];
+            if (!test_bit(pfn, dirty_pages))
+                continue;
+
+            live_p2m_one = (xen_pfn_t *)((char *)live_p2m + PAGE_SIZE * j++);
+            /* If the domain we''re restoring has a different word size
to ours,
+             * we need to adjust the live_p2m assignment appropriately */
+            if (dinfo->guest_width > sizeof (xen_pfn_t))
+            {
+                n = (i + 1) * FPP - 1;
+                for (i = FPP - 1; i >= 0; i--)
+                    ((uint64_t *)live_p2m_one)[i] =
(long)comm_data->p2m[n--];
+            }
+            else if (dinfo->guest_width < sizeof (xen_pfn_t))
+            {
+                n = i * FPP;
+                for (i = 0; i < FPP; i++)
+                    ((uint32_t *)live_p2m_one)[i] = comm_data->p2m[n++];
+            }
+            else
+            {
+                p2m = (xen_pfn_t *)((char *)comm_data->p2m + PAGE_SIZE * i);
+                memcpy(live_p2m_one, p2m, PAGE_SIZE);
+            }
+        }
+        munmap(live_p2m, j * PAGE_SIZE);
+    }
+
+    return 0;
+}
+
+static int update_pfn_type(xc_interface *xch, uint32_t dom, int count,
xen_pfn_t *pfn_batch,
+   xen_pfn_t *pfn_type_batch, xen_pfn_t *pfn_type)
+{
+    unsigned long k;
+
+    if (xc_get_pfn_type_batch(xch, dom, count, pfn_type_batch))
+    {
+        ERROR("xc_get_pfn_type_batch for slaver failed");
+        return -1;
+    }
+
+    for (k = 0; k < count; k++)
+        pfn_type[pfn_batch[k]] = pfn_type_batch[k] &
XEN_DOMCTL_PFINFO_LTAB_MASK;
+
+    return 0;
+}
+
+/* we are ready to start the guest when this functions is called. We
+ * will return until we need to do a new checkpoint or some error occurs.
+ *
+ * communication with python
+ * python code                  restore code        comment
+ *                  <====       "finish\n"
+ * "resume\n"       ====>                           guest is
resumed
+ *                  <====       "resume\n"          postresume is
done
+ * "suspend\n"      ====>                           a new
checkpoint begins
+ *                  <====       "suspend\n"         guest is
suspended
+ * "start\n"        ====>                           getting dirty
pages begins
+ *
+ * return value:
+ * -1: error
+ *  0: continue to start vm
+ *  1: continue to do a checkpoint
+ */
+int finish_colo(struct restore_data *comm_data, void *data)
+{
+    struct restore_colo_data *colo_data = data;
+    xc_interface *xch = comm_data->xch;
+    uint32_t dom = comm_data->dom;
+    struct domain_info_context *dinfo = comm_data->dinfo;
+    xc_evtchn *xce = colo_data->xce;
+    unsigned long *pfn_batch_slaver = colo_data->pfn_batch_slaver;
+    unsigned long *pfn_type_batch_slaver = colo_data->pfn_type_batch_slaver;
+    unsigned long *pfn_type_slaver = colo_data->pfn_type_slaver;
+    DECLARE_HYPERCALL;
+
+    unsigned long i, j;
+    int rc;
+    char str[10];
+    int remote_port;
+    int local_port = colo_data->local_port;
+
+#if 0
+    /* output the store-mfn & console-mfn */
+    printf("store-mfn %li\n", *store_mfn);
+    printf("console-mfn %li\n", *console_mfn);
+#endif
+
+    /* we need to know which pages are dirty to restore the guest */
+    if (xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY, NULL,
+                          0, NULL, 0, NULL) < 0 )
+    {
+        rc = xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_OFF, NULL, 0,
+                               NULL, 0, NULL);
+        if (rc >= 0)
+        {
+            rc = xc_shadow_control(xch, dom,
+                                   XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY, NULL,
+                                   0, NULL, 0, NULL);
+        }
+        if (rc < 0)
+        {
+            ERROR("enabling logdirty fails");
+            return -1;
+        }
+    }
+
+    /* notify python code checkpoint finish */
+    printf("finish\n");
+    fflush(stdout);
+
+    /* wait domain resume, then connect the suspend evtchn */
+    scanf("%s", str);
+
+    if (colo_data->first_time) {
+        sleep(10);
+        remote_port = xs_suspend_evtchn_port(dom);
+        if (remote_port < 0) {
+            ERROR("getting remote suspend port fails");
+            return -1;
+        }
+
+        local_port = xc_suspend_evtchn_init(xch, xce, dom, remote_port);
+        if (local_port < 0) {
+            ERROR("initializing suspend evtchn fails");
+            return -1;
+        }
+
+        colo_data->local_port = local_port;
+    }
+
+    /* notify python code vm is resumed */
+    printf("resume\n");
+    fflush(stdout);
+
+    /* wait for the next checkpoint */
+    scanf("%s", str);
+    if (strcmp(str, "suspend"))
+    {
+        ERROR("wait for a new checkpoint fails");
+        /* start the guest now? */
+        return 0;
+    }
+
+    /* notify the suspend evtchn */
+    rc = xc_evtchn_notify(xce, local_port);
+    if (rc < 0)
+    {
+        ERROR("notifying the suspend evtchn fails");
+        return -1;
+    }
+
+    rc = xc_await_suspend(xch, xce, local_port);
+    if (rc < 0)
+    {
+        ERROR("waiting suspend fails");
+        return -1;
+    }
+
+    /* notify python code suspend is done */
+    printf("suspend\n");
+    fflush(stdout);
+
+    scanf("%s", str);
+
+    if (strcmp(str, "start"))
+        return -1;
+
+    memset(colo_data->dirty_pages, 0x0, BITMAP_SIZE);
+    if (xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_CLEAN,
+                          HYPERCALL_BUFFER(dirty_pages), dinfo->p2m_size,
+                          NULL, 0, NULL) != dinfo->p2m_size)
+    {
+        ERROR("getting slaver dirty fails");
+        return -1;
+    }
+
+    if (xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_OFF, NULL, 0, NULL,
+                          0, NULL) < 0 )
+    {
+        ERROR("disabling dirty-log fails");
+        return -1;
+    }
+
+    j = 0;
+    for (i = 0; i < colo_data->max_mem_pfn; i++)
+    {
+        if ( !test_bit(i, colo_data->dirty_pages) )
+            continue;
+
+        pfn_batch_slaver[j] = i;
+        pfn_type_batch_slaver[j++] = comm_data->p2m[i];
+        if (j == MAX_BATCH_SIZE)
+        {
+            if (update_pfn_type(xch, dom, j, pfn_batch_slaver,
+                                pfn_type_batch_slaver, pfn_type_slaver))
+            {
+                return -1;
+            }
+            j = 0;
+        }
+    }
+
+    if (j)
+    {
+        if (update_pfn_type(xch, dom, j, pfn_batch_slaver,
+                            pfn_type_batch_slaver, pfn_type_slaver))
+        {
+            return -1;
+        }
+    }
+
+    /* reset memory */
+    hypercall.op = __HYPERVISOR_reset_memory_op;
+    hypercall.arg[0] = (unsigned long)dom;
+    do_xen_hypercall(xch, &hypercall);
+
+    colo_data->first_time = 0;
+
+    return 1;
+}
diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
index 222aa03..3aafa61 100644
--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -28,8 +28,7 @@
 
 #include "xc_private.h"
 #include "xc_dom.h"
-#include "xg_private.h"
-#include "xg_save_restore.h"
+#include "xc_save_restore_colo.h"
 
 #include <xen/hvm/params.h>
 #include "xc_e820.h"
@@ -82,37 +81,6 @@ struct outbuf {
      ((mfn_to_pfn(_mfn) < (dinfo->p2m_size)) &&   \
       (pfn_to_mfn(mfn_to_pfn(_mfn)) == (_mfn))))
 
-/*
-** During (live) save/migrate, we maintain a number of bitmaps to track
-** which pages we have to send, to fixup, and to skip.
-*/
-
-#define BITS_PER_LONG (sizeof(unsigned long) * 8)
-#define BITS_TO_LONGS(bits) (((bits)+BITS_PER_LONG-1)/BITS_PER_LONG)
-#define BITMAP_SIZE   (BITS_TO_LONGS(dinfo->p2m_size) * sizeof(unsigned
long))
-
-#define BITMAP_ENTRY(_nr,_bmap) \
-   ((volatile unsigned long *)(_bmap))[(_nr)/BITS_PER_LONG]
-
-#define BITMAP_SHIFT(_nr) ((_nr) % BITS_PER_LONG)
-
-#define ORDER_LONG (sizeof(unsigned long) == 4 ? 5 : 6)
-
-static inline int test_bit (int nr, volatile void * addr)
-{
-    return (BITMAP_ENTRY(nr, addr) >> BITMAP_SHIFT(nr)) & 1;
-}
-
-static inline void clear_bit (int nr, volatile void * addr)
-{
-    BITMAP_ENTRY(nr, addr) &= ~(1UL << BITMAP_SHIFT(nr));
-}
-
-static inline void set_bit ( int nr, volatile void * addr)
-{
-    BITMAP_ENTRY(nr, addr) |= (1UL << BITMAP_SHIFT(nr));
-}
-
 /* Returns the hamming weight (i.e. the number of bits set) in a N-bit word */
 static inline unsigned int hweight32(unsigned int w)
 {
diff --git a/tools/libxc/xc_save_restore_colo.h
b/tools/libxc/xc_save_restore_colo.h
new file mode 100644
index 0000000..1283c9c
--- /dev/null
+++ b/tools/libxc/xc_save_restore_colo.h
@@ -0,0 +1,44 @@
+#ifndef XC_SAVE_RESTORE_COLO_H
+#define XC_SAVE_RESTORE_COLO_H
+
+#include <xg_save_restore.h>
+#include <xg_private.h>
+
+extern int restore_colo_init(struct restore_data *, void **);
+extern void restore_colo_free(struct restore_data *, void *);
+extern char* get_page(struct restore_data *, void *, unsigned long);
+extern int flush_memory(struct restore_data *, void *);
+extern int update_p2m_table(struct restore_data *, void *);
+extern int finish_colo(struct restore_data *, void *);
+
+/*
+** During (live) save/migrate, we maintain a number of bitmaps to track
+** which pages we have to send, to fixup, and to skip.
+*/
+
+#define BITS_PER_LONG (sizeof(unsigned long) * 8)
+#define BITS_TO_LONGS(bits) (((bits)+BITS_PER_LONG-1)/BITS_PER_LONG)
+#define BITMAP_SIZE   (BITS_TO_LONGS(dinfo->p2m_size) * sizeof(unsigned
long))
+
+#define BITMAP_ENTRY(_nr,_bmap) \
+   ((volatile unsigned long *)(_bmap))[(_nr)/BITS_PER_LONG]
+
+#define BITMAP_SHIFT(_nr) ((_nr) % BITS_PER_LONG)
+
+#define ORDER_LONG (sizeof(unsigned long) == 4 ? 5 : 6)
+
+static inline int test_bit (int nr, volatile void * addr)
+{
+    return (BITMAP_ENTRY(nr, addr) >> BITMAP_SHIFT(nr)) & 1;
+}
+
+static inline void clear_bit (int nr, volatile void * addr)
+{
+    BITMAP_ENTRY(nr, addr) &= ~(1UL << BITMAP_SHIFT(nr));
+}
+
+static inline void set_bit ( int nr, volatile void * addr)
+{
+    BITMAP_ENTRY(nr, addr) |= (1UL << BITMAP_SHIFT(nr));
+}
+#endif
diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h
index 93c3fe3..d7ee050 100644
--- a/xen/include/public/xen.h
+++ b/xen/include/public/xen.h
@@ -93,6 +93,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_pfn_t);
 #define __HYPERVISOR_domctl               36
 #define __HYPERVISOR_kexec_op             37
 #define __HYPERVISOR_tmem_op              38
+#define __HYPERVISOR_reset_memory_op      40
 
 /* Architecture-specific hypercall definitions. */
 #define __HYPERVISOR_arch_0               48
-- 
1.8.0

Wen Congyang

2013-Apr-03 08:02 UTC

head link

[RFC PATCH 4/7] xc_domain_save: flush cache before calling callbacks->postcopy()

callbacks->postcopy() may use the fd to transfer something to the
other end, so we should flush cache before calling callbacks->postcopy()

Signed-off-by: Ye Wei <wei.ye1987@gmail.com>
Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxc/xc_domain_save.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
index 3aafa61..cc4004a 100644
--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -1886,9 +1886,6 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t
dom, uint32_t max_iter
  out:
     completed = 1;
 
-    if ( !rc && callbacks->postcopy )
-        callbacks->postcopy(callbacks->data);
-
     /* Flush last write and discard cache for file. */
     if ( outbuf_flush(xch, &ob, io_fd) < 0 ) {
         PERROR("Error when flushing output buffer");
@@ -1897,6 +1894,9 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t
dom, uint32_t max_iter
 
     discard_file_cache(xch, io_fd, 1 /* flush */);
 
+    if ( !rc && callbacks->postcopy )
+        callbacks->postcopy(callbacks->data);
+
     /* checkpoint_cb can spend arbitrarily long in between rounds */
     if (!rc && callbacks->checkpoint &&
         callbacks->checkpoint(callbacks->data) > 0)
-- 
1.8.0

Wen Congyang

2013-Apr-03 08:02 UTC

head link

[RFC PATCH 5/7] xc_domain_save: implement save_callbacks for colo

Add a new save callbacks:
1. post_sendstate(): SVM will run only when XC_SAVE_ID_LAST_CHECKPOINT is
   sent to slaver. But we only sent XC_SAVE_ID_LAST_CHECKPOINT when we do
   live migration now. Add this callback, and we can send it in this
   callback.

Update some callbacks for colo:
1. suspend(): In colo mode, both PVM and SVM are running. So we should suspend
        both PVM and SVM.
        Communicate with slaver like this:
        a. write "continue" to notify slaver to suspend SVM
        b. suspend PVM and SVM
        c. slaver writes "suspend" to tell master that SVM is
suspended
2. postcopy(): In colo mode, both PVM and SVM are running, and we have suspended
        both PVM and SVM. So we should resume PVM and SVM
        Communicate with slaver like this:
        a. write "resume" to notify slaver to resume SVM
        b. resume PVM and SVM
        c. slaver writes "resume" to tell master that SVM is resumed
3. checkpoint(): In colo mode, we do a new checkpoint only when output packet
    from PVM and SVM is different. We will block in this callback and return
    when a output packet is different.

Signed-off-by: Ye Wei <wei.ye1987@gmail.com>
Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>

---
 tools/libxc/xc_domain_save.c                      |   9 +
 tools/libxc/xenguest.h                            |   3 +
 tools/python/xen/lowlevel/checkpoint/checkpoint.c | 289 +++++++++++++++++++++-
 tools/python/xen/lowlevel/checkpoint/checkpoint.h |   2 +
 4 files changed, 299 insertions(+), 4 deletions(-)

diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
index cc4004a..870fea5 100644
--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -1645,6 +1645,15 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t
dom, uint32_t max_iter
         }
     }
 
+    if ( callbacks->post_sendstate )
+    {
+        if ( callbacks->post_sendstate(callbacks->data) < 0)
+        {
+            PERROR("Error: post_sendstate()\n");
+            goto out;
+        }
+    }
+
     /* Zero terminate */
     i = 0;
     if ( wrexact(io_fd, &i, sizeof(int)) )
diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h
index 709a284..04d2aaf 100644
--- a/tools/libxc/xenguest.h
+++ b/tools/libxc/xenguest.h
@@ -43,6 +43,9 @@ struct save_callbacks {
     /* Enable qemu-dm logging dirty pages to xen */
     int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* HVM
only */
 
+    /* called before Zero terminate is sent */
+    int (*post_sendstate)(void *data);
+
     /* to be provided as the last argument to each callback function */
     void* data;
 };
diff --git a/tools/python/xen/lowlevel/checkpoint/checkpoint.c
b/tools/python/xen/lowlevel/checkpoint/checkpoint.c
index 7545d7d..f880f1b 100644
--- a/tools/python/xen/lowlevel/checkpoint/checkpoint.c
+++ b/tools/python/xen/lowlevel/checkpoint/checkpoint.c
@@ -1,14 +1,22 @@
 /* python bridge to checkpointing API */
 
 #include <Python.h>
+#include <sys/wait.h>
 
 #include <xs.h>
 #include <xenctrl.h>
+#include <xc_private.h>
+#include <xg_save_restore.h>
 
 #include "checkpoint.h"
 
 #define PKG "xen.lowlevel.checkpoint"
 
+#define COMP_IOC_MAGIC    ''k''
+#define COMP_IOCTWAIT     _IO(COMP_IOC_MAGIC, 0)
+#define COMP_IOCTFLUSH    _IO(COMP_IOC_MAGIC, 1)
+#define COMP_IOCTRESUME   _IO(COMP_IOC_MAGIC, 2)
+
 static PyObject* CheckpointError;
 
 typedef struct {
@@ -24,11 +32,15 @@ typedef struct {
   PyObject* checkpoint_cb;
 
   PyThreadState* threadstate;
+  int colo;
+  int first_time;
+  int dev_fd;
 } CheckpointObject;
 
 static int suspend_trampoline(void* data);
 static int postcopy_trampoline(void* data);
 static int checkpoint_trampoline(void* data);
+static int post_sendstate_trampoline(void *data);
 
 static PyObject* Checkpoint_new(PyTypeObject* type, PyObject* args,
                                PyObject* kwargs)
@@ -105,6 +117,7 @@ static PyObject* pycheckpoint_start(PyObject* obj, PyObject*
args) {
   int fd;
   struct save_callbacks callbacks;
   int rc;
+  int flags = 0;
 
   if (!PyArg_ParseTuple(args, "O|OOOI", &iofile, &suspend_cb,
&postcopy_cb,
                        &checkpoint_cb, &interval))
@@ -151,9 +164,16 @@ static PyObject* pycheckpoint_start(PyObject* obj,
PyObject* args) {
   } else
     self->checkpoint_cb = NULL;
 
+  if (flags & CHECKPOINT_FLAGS_COLO)
+    self->colo = 1;
+  else
+    self->colo = 0;
+  self->first_time = 1;
+
   callbacks.suspend = suspend_trampoline;
   callbacks.postcopy = postcopy_trampoline;
   callbacks.checkpoint = checkpoint_trampoline;
+  callbacks.post_sendstate = post_sendstate_trampoline;
   callbacks.data = self;
 
   self->threadstate = PyEval_SaveThread();
@@ -258,6 +278,192 @@ PyMODINIT_FUNC initcheckpoint(void) {
   block_timer();
 }
 
+/* colo functions */
+
+/* master                   slaver          comment
+ * "continue"   ===>
+ *              <===        "suspend"       guest is suspended
+ */
+static int notify_slaver_suspend(CheckpointObject *self)
+{
+    int fd = self->cps.fd;
+
+    return write_exact(fd, "continue", 8);
+}
+
+static int wait_slaver_suspend(CheckpointObject *self)
+{
+    int fd = self->cps.fd;
+    xc_interface *xch = self->cps.xch;
+    char buf[8];
+
+    if (self->first_time) {
+        self->first_time = 0;
+        return 0;
+    }
+
+    if ( read_exact(fd, buf, 7) < 0) {
+        PERROR("read: suspend");
+        return -1;
+    }
+
+    buf[7] = ''\0'';
+    if (strcmp(buf, "suspend")) {
+        PERROR("read \"%s\", expect \"suspend\"",
buf);
+        return -1;
+    }
+
+    return 0;
+}
+
+static int notify_slaver_start_checkpoint(CheckpointObject *self)
+{
+    int fd = self->cps.fd;
+    xc_interface *xch = self->cps.xch;
+
+    if ( write_exact(fd, "start", 8) < 0) {
+        PERROR("write start");
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * master                       slaver
+ *                  <====       "finish"
+ * flush packets
+ * "resume"         ====>
+ * resume vm                    resume vm
+ *                  <====       "resume"
+ */
+static int notify_slaver_resume(CheckpointObject *self)
+{
+    int fd = self->cps.fd;
+    xc_interface *xch = self->cps.xch;
+    char buf[7];
+
+    /* wait slaver to finish update memory, device state... */
+    if ( read_exact(fd, buf, 6) < 0) {
+        PERROR("read: finish");
+        return -1;
+    }
+
+    buf[6] = ''\0'';
+    if (strcmp(buf, "finish")) {
+        ERROR("read \"%s\", expect \"finish\"",
buf);
+        return -1;
+    }
+
+    if (!self->first_time)
+        /* flush queued packets now */
+        ioctl(self->dev_fd, COMP_IOCTFLUSH);
+
+    /* notify slaver to resume vm*/
+    if (write_exact(fd, "resume", 6)) {
+        PERROR("write: resume");
+        return -1;
+    }
+
+    return 0;
+}
+
+static int install_fw_network(CheckpointObject *self)
+{
+    pid_t pid;
+    xc_interface *xch = self->cps.xch;
+    int status;
+    int rc;
+
+    pid = vfork();
+    if (pid < 0) {
+        PERROR("vfork fails");
+        return -1;
+    }
+
+    if (pid > 0) {
+        rc = wait(&status);
+        if (rc != 0 || status != 0) {
+            ERROR("getting child status fails");
+            return -1;
+        }
+
+        return 0;
+    }
+
+    execl("/etc/xen/scripts/HA_fw_runtime.sh",
"HA_fw_runtime.sh", "install", NULL);
+    PERROR("execl fails");
+    return -1;
+}
+
+static int wait_slaver_resume(CheckpointObject *self)
+{
+    int fd = self->cps.fd;
+    xc_interface *xch = self->cps.xch;
+    char buf[7];
+
+    if (read_exact(fd, buf, 6) < 0) {
+        PERROR("read resume");
+        return -1;
+    }
+
+    buf[6] = ''\0'';
+    if (strcmp(buf, "resume")) {
+        ERROR("read \"%s\", expect \"resume\"",
buf);
+        return -1;
+    }
+
+    return 0;
+}
+
+static int colo_postresume(CheckpointObject *self)
+{
+    int rc;
+    int dev_fd = self->dev_fd;
+
+    rc = wait_slaver_resume(self);
+    if (rc < 0)
+        return rc;
+
+    if (self->first_time) {
+        rc = install_fw_network(self);
+        if (rc < 0)
+            return rc;
+    } else {
+        ioctl(dev_fd, COMP_IOCTRESUME);
+    }
+
+    return 0;
+}
+
+static int pre_checkpoint(CheckpointObject *self)
+{
+    xc_interface *xch = self->cps.xch;
+
+    if (!self->first_time)
+        return 0;
+
+    self->dev_fd = open("/dev/HA_compare", O_RDWR);
+    if (self->dev_fd < 0) {
+        PERROR("opening /dev/HA_compare fails");
+        return -1;
+    }
+
+    return 0;
+}
+
+static void wait_new_checkpoint(CheckpointObject *self)
+{
+    int dev_fd = self->dev_fd;
+    int err;
+
+    while (1) {
+        err = ioctl(dev_fd, COMP_IOCTWAIT);
+        if (err == 0 || err == -1)
+            break;
+    }
+}
+
 /* private functions */
 
 /* bounce C suspend call into python equivalent.
@@ -268,6 +474,13 @@ static int suspend_trampoline(void* data)
 
   PyObject* result;
 
+  if (self->colo) {
+    if (notify_slaver_suspend(self) < 0) {
+      fprintf(stderr, "nofitying slaver suspend fails\n");
+      return 0;
+    }
+  }
+
   /* call default suspend function, then python hook if available */
   if (self->armed) {
     if (checkpoint_wait(&self->cps) < 0) {
@@ -286,8 +499,16 @@ static int suspend_trampoline(void* data)
     }
   }
 
+  /* suspend_cb() should be called after both sides are suspended */
+  if (self->colo) {
+    if (wait_slaver_suspend(self) < 0) {
+      fprintf(stderr, "waiting slaver suspend fails\n");
+      return 0;
+    }
+  }
+
   if (!self->suspend_cb)
-    return 1;
+    goto start_checkpoint;
 
   PyEval_RestoreThread(self->threadstate);
   result = PyObject_CallFunction(self->suspend_cb, NULL);
@@ -298,12 +519,24 @@ static int suspend_trampoline(void* data)
 
   if (result == Py_None || PyObject_IsTrue(result)) {
     Py_DECREF(result);
-    return 1;
+    goto start_checkpoint;
   }
 
   Py_DECREF(result);
 
   return 0;
+
+start_checkpoint:
+  if (self->colo) {
+    if (notify_slaver_start_checkpoint(self) < 0) {
+      fprintf(stderr, "nofitying slaver to start checkpoint
fails\n");
+      return 0;
+    }
+
+    self->first_time = 0;
+  }
+
+  return 1;
 }
 
 static int postcopy_trampoline(void* data)
@@ -313,6 +546,13 @@ static int postcopy_trampoline(void* data)
   PyObject* result;
   int rc = 0;
 
+  if (self->colo) {
+    if (notify_slaver_resume(self) < 0) {
+      fprintf(stderr, "nofitying slaver resume fails\n");
+      return 0;
+    }
+  }
+
   if (!self->postcopy_cb)
     goto resume;
 
@@ -331,6 +571,13 @@ static int postcopy_trampoline(void* data)
     return 0;
   }
 
+  if (self->colo) {
+    if (colo_postresume(self) < 0) {
+      fprintf(stderr, "postresume fails\n");
+      return 0;
+    }
+  }
+
   return rc;
 }
 
@@ -345,8 +592,15 @@ static int checkpoint_trampoline(void* data)
       return -1;
   }
 
+  if (self->colo) {
+    if (pre_checkpoint(self) < 0) {
+      fprintf(stderr, "pre_checkpoint() fails\n");
+      return -1;
+    }
+  }
+
   if (!self->checkpoint_cb)
-    return 0;
+    goto wait_checkpoint;
 
   PyEval_RestoreThread(self->threadstate);
   result = PyObject_CallFunction(self->checkpoint_cb, NULL);
@@ -357,10 +611,37 @@ static int checkpoint_trampoline(void* data)
 
   if (result == Py_None || PyObject_IsTrue(result)) {
     Py_DECREF(result);
-    return 1;
+    goto wait_checkpoint;
   }
 
   Py_DECREF(result);
 
   return 0;
+
+wait_checkpoint:
+  if (self->colo) {
+    wait_new_checkpoint(self);
+  }
+
+  return 1;
+}
+
+static int post_sendstate_trampoline(void* data)
+{
+  CheckpointObject *self = data;
+  int fd = self->cps.fd;
+  int i = XC_SAVE_ID_LAST_CHECKPOINT;
+
+  if (!self->colo)
+    return 0;
+
+  /* In colo mode, guest is running on slaver side, so we should
+   * send XC_SAVE_ID_LAST_CHECKPOINT to slaver.
+   */
+  if (write_exact(fd, &i, sizeof(int)) < 0) {
+    fprintf(stderr, "writing XC_SAVE_ID_LAST_CHECKPOINT fails\n");
+    return -1;
+  }
+
+  return 0;
 }
diff --git a/tools/python/xen/lowlevel/checkpoint/checkpoint.h
b/tools/python/xen/lowlevel/checkpoint/checkpoint.h
index 36455fb..5dd6440 100644
--- a/tools/python/xen/lowlevel/checkpoint/checkpoint.h
+++ b/tools/python/xen/lowlevel/checkpoint/checkpoint.h
@@ -40,6 +40,8 @@ typedef struct {
     timer_t timer;
 } checkpoint_state;
 
+#define CHECKPOINT_FLAGS_COLO        2
+
 char* checkpoint_error(checkpoint_state* s);
 
 void checkpoint_init(checkpoint_state* s);
-- 
1.8.0

Wen Congyang

2013-Apr-03 08:02 UTC

head link

[RFC PATCH 6/7] XendCheckpoint: implement colo

In colo mode, XendCheckpoit.py will communicate with both master and
xc_restore. This patch implements this communication. In colo mode,
the signature is "GuestColoRestore".

Signed-off-by: Ye Wei <wei.ye1987@gmail.com>
Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>

---
 tools/python/xen/xend/XendCheckpoint.py | 138 +++++++++++++++++++++++---------
 1 file changed, 101 insertions(+), 37 deletions(-)

diff --git a/tools/python/xen/xend/XendCheckpoint.py
b/tools/python/xen/xend/XendCheckpoint.py
index fa09757..261d9d1 100644
--- a/tools/python/xen/xend/XendCheckpoint.py
+++ b/tools/python/xen/xend/XendCheckpoint.py
@@ -25,6 +25,7 @@ from xen.xend.XendConstants import *
 from xen.xend import XendNode
 
 SIGNATURE = "LinuxGuestRecord"
+COLO_SIGNATURE = "GuestColoRestore"
 QEMU_SIGNATURE = "QemuDeviceModelRecord"
 dm_batch = 512
 XC_SAVE = "xc_save"
@@ -203,10 +204,15 @@ def restore(xd, fd, dominfo = None, paused = False,
relocating = False):
 
     signature = read_exact(fd, len(SIGNATURE),
         "not a valid guest state file: signature read")
-    if signature != SIGNATURE:
+    if signature != SIGNATURE and signature != COLO_SIGNATURE:
         raise XendError("not a valid guest state file: found
''%s''" %
                         signature)
 
+    if signature == COLO_SIGNATURE:
+        colo = True
+    else
+        colo = False
+
     l = read_exact(fd, sizeof_int,
                    "not a valid guest state file: config size read")
     vmconfig_size = unpack("!i", l)[0]
@@ -305,6 +311,7 @@ def restore(xd, fd, dominfo = None, paused = False,
relocating = False):
         log.debug("[xc_restore]: %s", string.join(cmd))
 
         handler = RestoreInputHandler()
+        restore_handler = RestoreHandler(fd, colo, dominfo, inputHandler)
 
         forkHelper(cmd, fd, handler.handler, True)
 
@@ -321,35 +328,9 @@ def restore(xd, fd, dominfo = None, paused = False,
relocating = False):
             raise XendError(''Could not read store MFN'')
 
         if not is_hvm and handler.console_mfn is None:
-            raise XendError(''Could not read console MFN'')
-
-        restore_image.setCpuid()
-
-        # xc_restore will wait for source to close connection
-        
-        dominfo.completeRestore(handler.store_mfn, handler.console_mfn)
-
-        #
-        # We shouldn''t hold the domains_lock over a waitForDevices
-        # As this function sometime gets called holding this lock,
-        # we must release it and re-acquire it appropriately
-        #
-        from xen.xend import XendDomain
+            raise XendError(''Could not read console MFN'')
 
-        lock = True;
-        try:
-            XendDomain.instance().domains_lock.release()
-        except:
-            lock = False;
-
-        try:
-            dominfo.waitForDevices() # Wait for backends to set up
-        finally:
-            if lock:
-                XendDomain.instance().domains_lock.acquire()
-
-        if not paused:
-            dominfo.unpause()
+        restorehandler.resume(True, paused, None)
 
         return dominfo
     except Exception, exn:
@@ -358,23 +339,106 @@ def restore(xd, fd, dominfo = None, paused = False,
relocating = False):
         raise exn
 
 
+class RestoreHandler:
+    def __init__(self, fd, colo, dominfo, inputHandler):
+        self.fd = fd
+        self.colo = colo
+        self.firsttime = True
+        self.inputHandler = inputHandler
+        self.dominfo = dominfo
+
+    def resume(self, finish, paused, child):
+        fd = self.fd
+        dominfo = self.dominfo
+        handler = self.inputHandler
+        restore_image.setCpuid()
+        dominfo.completeRestore(handler.store_mfn, handler.console_mfn)
+
+        if self.colo and not finish:
+            # notify master that checkpoint finishes
+            write_exact(fd, "finish", "failed to write finish
done")
+            buf = read_exact(fd, 6, "failed to read resume flag")
+            if buf != "resume":
+                return False
+
+        from xen.xend import XendDomain
+
+        if self.firsttime:
+            lock = True;
+            try:
+                XendDomain.instance().domains_lock.release()
+            except:
+                lock = False;
+
+            try:
+                dominfo.waitForDevices() # Wait for backends to set up
+            finally:
+                if lock:
+                    XendDomain.instance().domains_lock.acquire()
+            if not paused:
+                dominfo.unpause()
+        else:
+            # colo
+            xc.domain_resume(dominfo.domid, 0)
+            ResumeDomain(dominfo.domid)
+
+        if self.colo and not finish:
+            child.tochild.write("resume\n")
+            child.tochild.flush()
+            buf = child.fromchild.readline()
+            if buf != "resume\n":
+                return False
+            if self.firsttime:
+                util.runcmd("/etc/xen/scripts/HA_fw_runtime.sh
slaver")
+            # notify master side VM resumed
+            write_exact(fd, "resume", "failed to write resume
done");
+
+            # wait new checkpoint
+            buf = read_exact(fd, 8, "failed to read continue flag")
+            if buf != "continue":
+                return False
+
+            child.tochild.write("suspend\n")
+            buf = child.fromchild.readline()
+            if buf != "suspend\n":
+                return False
+
+            # notify master side suspend done.
+            write_exact(fd, "suspend", "failed to write suspend
done")
+            buf = read_exact(fd, 5, "failed to read start flag")
+            if buf != "start":
+                return False
+
+            child.tochild.write("start\n")
+            child.tochild.flush()
+
+            self.firsttime = False
+
 class RestoreInputHandler:
-    def __init__(self):
+    def __init__(self, colo):
         self.store_mfn = None
         self.console_mfn = None
 
 
-    def handler(self, line, _):
+    def handler(self, line, child, restorehandler):
+        if line == "finish\n":
+            # colo
+            return restorehandler.resume(False, False, child)
+
         m = re.match(r"^(store-mfn) (\d+)$", line)
         if m:
             self.store_mfn = int(m.group(2))
-        else:
-            m = re.match(r"^(console-mfn) (\d+)$", line)
-            if m:
-                self.console_mfn = int(m.group(2))
+            return True
+
+        m = re.match(r"^(console-mfn) (\d+)$", line)
+        if m:
+            self.console_mfn = int(m.group(2))
+            return True
+
+        return False
 
 
-def forkHelper(cmd, fd, inputHandler, closeToChild):
+def forkHelper(cmd, fd, inputHandler, closeToChild, restorehandler):
     child = xPopen3(cmd, True, -1, [fd])
 
     if closeToChild:
@@ -392,7 +456,7 @@ def forkHelper(cmd, fd, inputHandler, closeToChild):
                 else:
                     line = line.rstrip()
                     log.debug(''%s'', line)
-                    inputHandler(line, child.tochild)
+                    inputHandler(line, child, restorehandler)
 
         except IOError, exn:
             raise XendError(''Error reading from child process for %s:
%s'' %
-- 
1.8.0

Wen Congyang

2013-Apr-03 08:02 UTC

head link

[RFC PATCH 7/7] remus: implement colo mode

Add a new option --colo to the command remus. We will ignore the
options: --time, -i, --no-net when --colo is specified. In colo
mode, we will write new signature "GuestColoRestore". If the xen-tool
in secondary machine does not support colo, it will reject this
signature, and the command remus will fail.

Signed-off-by: Ye Wei <wei.ye1987@gmail.com>
Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>

---
 tools/python/xen/remus/image.py | 7 +++++--
 tools/python/xen/remus/save.py  | 6 ++++--
 tools/remus/remus               | 8 +++++++-
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/tools/python/xen/remus/image.py b/tools/python/xen/remus/image.py
index b79d1e5..0927314 100644
--- a/tools/python/xen/remus/image.py
+++ b/tools/python/xen/remus/image.py
@@ -189,9 +189,12 @@ def parseheader(header):
     "parses a header sexpression"
     return vm.parsedominfo(vm.strtosxpr(header))
 
-def makeheader(dominfo):
+def makeheader(dominfo, colo):
     "create an image header from a VM dominfo sxpr"
-    items = [SIGNATURE]
+    if colo:
+        items = [COLO_SIGNATURE]
+    else:
+        items = [SIGNATURE]
     sxpr = vm.sxprtostr(dominfo)
     items.append(struct.pack(''!i'', len(sxpr)))
     items.append(sxpr)
diff --git a/tools/python/xen/remus/save.py b/tools/python/xen/remus/save.py
index 71517da..5157153 100644
--- a/tools/python/xen/remus/save.py
+++ b/tools/python/xen/remus/save.py
@@ -127,7 +127,7 @@ class Keepalive(object):
 
 class Saver(object):
     def __init__(self, domid, fd, suspendcb=None, resumecb=None,
-                 checkpointcb=None, interval=0):
+                 checkpointcb=None, interval=0, colo):
         """Create a Saver object for taking guest checkpoints.
         domid:        name, number or UUID of a running domain
         fd:           a stream to which checkpoint data will be written.
@@ -135,12 +135,14 @@ class Saver(object):
         resumecb:     callback invoked before guest resumes
         checkpointcb: callback invoked when a checkpoint is complete. Return
                       True to take another checkpoint, or False to stop.
+        colo:         use colo mode
         """
         self.fd = fd
         self.suspendcb = suspendcb
         self.resumecb = resumecb
         self.checkpointcb = checkpointcb
         self.interval = interval
+        self.colo = colo
 
         self.vm = vm.VM(domid)
 
@@ -149,7 +151,7 @@ class Saver(object):
     def start(self):
         vm.getshadowmem(self.vm)
 
-        hdr = image.makeheader(self.vm.dominfo)
+        hdr = image.makeheader(self.vm.dominfo, self.colo)
         self.fd.write(hdr)
         self.fd.flush()
 
diff --git a/tools/remus/remus b/tools/remus/remus
index 11d83e4..34c200f 100644
--- a/tools/remus/remus
+++ b/tools/remus/remus
@@ -37,6 +37,8 @@ class Cfg(object):
                           help=''run without net buffering (benchmark
option)'')
         parser.add_option('''', ''--timer'',
dest=''timer'', action=''store_true'',
                           help=''force pause at checkpoint interval
(experimental)'')
+        parser.add_option('''', ''--colo'',
dest=''colo'', action=''store_true'',
+                          help=''use colo checkpointing
(experimental)'')
         self.parser = parser
 
     def usage(self):
@@ -53,6 +55,10 @@ class Cfg(object):
             self.netbuffer = False
         if opts.timer:
             self.timer = True
+        if opts.colo:
+            self.interval = 0
+            self.netbuffer = False
+            self.timer = True
 
         if not args:
             raise CfgException(''Missing domain'')
@@ -181,7 +187,7 @@ def run(cfg):
     rc = 0
 
     checkpointer = save.Saver(cfg.domid, fd, postsuspend, preresume, commit,
-                              interval)
+                              interval, cfg.colo)
 
     try:
         checkpointer.start()
-- 
1.8.0

George Dunlap

2013-Apr-03 11:44 UTC

head link

Re: [RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service

On Wed, Apr 3, 2013 at 9:02 AM, Wen Congyang <wency@cn.fujitsu.com>
wrote:> Virtual machine (VM) replication is a well known technique for providing
> application-agnostic software-implemented hardware fault tolerance -
> "non-stop service". Currently, remus provides this function, but
it buffers
> all output packets, and the latency is unacceptable.
Just FYI, as we''re in a feature freeze this can''t be accepted
until
the 4.3 release sometime in June; and since in the mean time we''ll be
trying to get other features sorted and bugs fixed, you may not get
much review time until then.

 -George

Tim Deegan

2013-Apr-04 09:26 UTC

head link

Re: [RFC PATCH 3/7] colo: implement restore_callbacks

Hi, 

At 16:02 +0800 on 03 Apr (1365004959), Wen Congyang
wrote:> +    /* reset memory */
> +    hypercall.op = __HYPERVISOR_reset_memory_op;
> +    hypercall.arg[0] = (unsigned long)dom;
> +    do_xen_hypercall(xch, &hypercall);
You''ve added a new hypercall here but I don''t see any
implementation (or
documentation).  Are there some xen-side patches missing?

Cheers,

Tim.
> @@ -93,6 +93,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_pfn_t);
>  #define __HYPERVISOR_domctl               36
>  #define __HYPERVISOR_kexec_op             37
>  #define __HYPERVISOR_tmem_op              38
> +#define __HYPERVISOR_reset_memory_op      40
>  
>  /* Architecture-specific hypercall definitions. */
>  #define __HYPERVISOR_arch_0               48
> -- 
> 1.8.0
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Shriram Rajagopalan

2013-Apr-05 03:55 UTC

head link

Re: [RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service

FYI, it would be nice to cc the maintainer when you submit such a major
functionality change,
especially for the xend code base that has so far been the only stable
workable remus solution.


On Wed, Apr 3, 2013 at 6:44 AM, George Dunlap
<George.Dunlap@eu.citrix.com>wrote:
> On Wed, Apr 3, 2013 at 9:02 AM, Wen Congyang <wency@cn.fujitsu.com>
wrote:
> > Virtual machine (VM) replication is a well known technique for
providing
> > application-agnostic software-implemented hardware fault tolerance -
> > "non-stop service". Currently, remus provides this function,
but it
> buffers
> > all output packets, and the latency is unacceptable.
>
> Just FYI, as we''re in a feature freeze this can''t be
accepted until
> the 4.3 release sometime in June; and since in the mean time we''ll
be
> trying to get other features sorted and bugs fixed, you may not get
> much review time until then.
>
>  -George
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Shriram Rajagopalan

2013-Apr-05 05:06 UTC

head link

Re: [RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service

On Wed, Apr 3, 2013 at 3:02 AM, Wen Congyang <wency@cn.fujitsu.com> wrote:
> Virtual machine (VM) replication is a well known technique for providing
> application-agnostic software-implemented hardware fault tolerance -
> "non-stop service". Currently, remus provides this function, but
it buffers
> all output packets, and the latency is unacceptable.
>
> In xen summit 2012, We introduce a new VM replication solution: colo
> (COarse-grain LOck-stepping virtual machine). The presentation is in
> the following URL:
>
>
http://www.slideshare.net/xen_com_mgr/colo-coarsegrain-lockstepping-virtual-machines-for-nonstop-service
>
> Here is the summary of the solution:
> From the client''s point of view, as long as the client observes
identical
> responses from the primary and secondary VMs, according to the service
> semantics, then the secondary VM(SVM) is a valid replica of the primary
> VM(PVM), and can successfully take over when a hardware failure of the
> PVM is detected.
>
> This patchset is RFC, and implements the frame of colo:
> 1. Both PVM and SVM are running
> 2. Forward the input packets from client to secondary machine(slaver)
> 3. Forward the output packets from SVM to primary machine(master)
> 4. Compare the output packets from PVM and SVM on the master side. If the
>    output packets are different, do a checkpoint
>
>
I skimmed through the presentation. Interesting approach. It would be nice
to have the performance report
that is mentioned in the slides, available, so that we can understand the
exact setups for the benchmarks.

A few quick thoughts after looking at the presentation:

 0. I am not completely sold on the type of applications you have used to
benchmark the system. They seem
     stateless and dont have much of memory churn (dirty pages/epoch).
     It would be nice to benchmark your system against something more
realistic, like the DVDStore benchmark
     or percona-tools'' TPCC benchmark with MySQL. [The clients have to
be
outside the system, mind you].
     And finally, something like Specweb2005, where there are about 1000
dirty pages per 25ms epoch.

     I care more about how many concurrent connections was handled by the
server and how frequently did you
     have to synchronize between the machines.

 1. The checkpoints are going to be very costly. If you are doing coarse
grained locking and assuming
     that checkpoint are triggered every second, you would probably lose
all benefits of checkpoint compression.
     Also your working set would have grown considerably large.  You will
inevitably end up taking the
     slow path, where you suspend the VM, "synchronously" send a ton
of
pages over the network
     (on the order of 10-100s of megabytes), and then resume the VM.
 Replicating this checkpoint is going to take
     a long time and will screw performance.

     The usual fast path has a small buffer (16/32 MB) to copy out the
dirty pages to the buffer, and asynchronously
     transmit it to the backup.


2. Whats the story for the DISK? The slides show that the VMs share a SAN
disk.
    And if both primary and secondary are operational, whose packets are
you going
    to discard, in a programmatic manner ?

    While you have a FTP server benchmark, it doesnt demonstrate output
consistency.
    I would suggest you run something like DVDStore (from Dell) or some
simple mysql TPCC
    and see if the clients raise a hue and cry about data corruption. ;)

3. What happens if one is running faster than the other ?
     Lets say the application does a bunch of read/write (dependent)
to/from the SAN.
     each write depends on the output of the previous read. And the writes
are non-deterministic
     (i.e. differ between primary and secondary).  Wont this system end up
in perpetual synchronization,
     since the outputs from primary and backup would be different, causing
a checkpoint again and again ?


And I would like to see atleast some *informal* guarantees of data
consistency - might sound academic but
when you are talking about putting critical customer applications like a
MySQL database, a SAP server or an
e-commerce web app, "consistency" matters! It helps to convince people
that
this system is not some half baked
experiment but something that is well thought of.

Once again, please CC me on the patches. Several files you have touched
belong to remus code
and the MAINTAINERS file has the maintainer info.

Nit: in one of your slides, you mentioned (75 ms/checkpoint, of which
2/3rds was spend in suspend/resume).
That isnt an artifact of Remus, FYI. I have run remus at 20ms checkpoint
interval, where VMs were suspended,
checkpointed and resumed in under 2ms.

With the addition of a ton of functionality -- both at the toolstack and in
the guest kernel -- the suspend resume times have
gone up considerably.  If you want to reduce that overhead, try a SuSe
based kernel that has suspend-event channel support.
You may not need any of those lazy netifs/netups etc.

Even with that, the new power management framework in the 3.* kernels seem
to have made suspend/resume pretty slow.


thanks
shriram

Changelog:>   Patch 1: optimize the dirty pages transfer speed.
>   Patch 2-3: allow SVM running after checkpoint
>   Patch 4-5: modification for colo on the master side(wait a new
> checkpoint,
>              communicate with slaver when doing checkoint)
>   Patch 6-7: implement colo''s user interface
>
> Wen Congyang (7):
>   xc_domain_save: cache pages mapping
>   xc_domain_restore: introduce restore_callbacks for colo
>   colo: implement restore_callbacks
>   xc_domain_save: flush cache before calling callbacks->postcopy()
>   xc_domain_save: implement save_callbacks for colo
>   XendCheckpoint: implement colo
>   remus: implement colo mode
>
>  tools/libxc/Makefile                              |   4 +-
>  tools/libxc/ia64/xc_ia64_linux_restore.c          |   3 +-
>  tools/libxc/xc_domain_restore.c                   | 256 +++++---
>  tools/libxc/xc_domain_restore_colo.c              | 740
> ++++++++++++++++++++++
>  tools/libxc/xc_domain_save.c                      | 162 +++--
>  tools/libxc/xc_save_restore_colo.h                |  44 ++
>  tools/libxc/xenguest.h                            |  57 +-
>  tools/libxl/libxl_dom.c                           |   2 +-
>  tools/python/xen/lowlevel/checkpoint/checkpoint.c | 289 ++++++++-
>  tools/python/xen/lowlevel/checkpoint/checkpoint.h |   2 +
>  tools/python/xen/remus/image.py                   |   7 +-
>  tools/python/xen/remus/save.py                    |   6 +-
>  tools/python/xen/xend/XendCheckpoint.py           | 138 ++--
>  tools/remus/remus                                 |   8 +-
>  tools/xcutils/xc_restore.c                        |   3 +-
>  xen/include/public/xen.h                          |   1 +
>  16 files changed, 1503 insertions(+), 219 deletions(-)
>  create mode 100644 tools/libxc/xc_domain_restore_colo.c
>  create mode 100644 tools/libxc/xc_save_restore_colo.h
>
> --
> 1.8.0
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2013-Apr-11 13:55 UTC

head link

Re: [RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service

On Fri, 2013-04-05 at 04:55 +0100, Shriram Rajagopalan wrote:
> the xend code base that has so far been the only stable workable remus
> solution.
What is the status of xl remus at the minute? Is it being actively
worked on?

Ian

Xen devel - Apr 2013 - [RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service

[RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service

[RFC PATCH 1/7] xc_domain_save: cache pages mapping

[RFC PATCH 2/7] xc_domain_restore: introduce restore_callbacks for colo

[RFC PATCH 3/7] colo: implement restore_callbacks

[RFC PATCH 4/7] xc_domain_save: flush cache before calling callbacks->postcopy()

[RFC PATCH 5/7] xc_domain_save: implement save_callbacks for colo

[RFC PATCH 6/7] XendCheckpoint: implement colo

[RFC PATCH 7/7] remus: implement colo mode

Re: [RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service

Re: [RFC PATCH 3/7] colo: implement restore_callbacks

Re: [RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service

Re: [RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service

Re: [RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service