Wen Congyang
2013-Apr-03 08:02 UTC
[RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
Virtual machine (VM) replication is a well known technique for providing application-agnostic software-implemented hardware fault tolerance - "non-stop service". Currently, remus provides this function, but it buffers all output packets, and the latency is unacceptable. In xen summit 2012, We introduce a new VM replication solution: colo (COarse-grain LOck-stepping virtual machine). The presentation is in the following URL: http://www.slideshare.net/xen_com_mgr/colo-coarsegrain-lockstepping-virtual-machines-for-nonstop-service Here is the summary of the solution: From the client''s point of view, as long as the client observes identical responses from the primary and secondary VMs, according to the service semantics, then the secondary VM(SVM) is a valid replica of the primary VM(PVM), and can successfully take over when a hardware failure of the PVM is detected. This patchset is RFC, and implements the frame of colo: 1. Both PVM and SVM are running 2. Forward the input packets from client to secondary machine(slaver) 3. Forward the output packets from SVM to primary machine(master) 4. Compare the output packets from PVM and SVM on the master side. If the output packets are different, do a checkpoint Changelog: Patch 1: optimize the dirty pages transfer speed. Patch 2-3: allow SVM running after checkpoint Patch 4-5: modification for colo on the master side(wait a new checkpoint, communicate with slaver when doing checkoint) Patch 6-7: implement colo''s user interface Wen Congyang (7): xc_domain_save: cache pages mapping xc_domain_restore: introduce restore_callbacks for colo colo: implement restore_callbacks xc_domain_save: flush cache before calling callbacks->postcopy() xc_domain_save: implement save_callbacks for colo XendCheckpoint: implement colo remus: implement colo mode tools/libxc/Makefile | 4 +- tools/libxc/ia64/xc_ia64_linux_restore.c | 3 +- tools/libxc/xc_domain_restore.c | 256 +++++--- tools/libxc/xc_domain_restore_colo.c | 740 ++++++++++++++++++++++ tools/libxc/xc_domain_save.c | 162 +++-- tools/libxc/xc_save_restore_colo.h | 44 ++ tools/libxc/xenguest.h | 57 +- tools/libxl/libxl_dom.c | 2 +- tools/python/xen/lowlevel/checkpoint/checkpoint.c | 289 ++++++++- tools/python/xen/lowlevel/checkpoint/checkpoint.h | 2 + tools/python/xen/remus/image.py | 7 +- tools/python/xen/remus/save.py | 6 +- tools/python/xen/xend/XendCheckpoint.py | 138 ++-- tools/remus/remus | 8 +- tools/xcutils/xc_restore.c | 3 +- xen/include/public/xen.h | 1 + 16 files changed, 1503 insertions(+), 219 deletions(-) create mode 100644 tools/libxc/xc_domain_restore_colo.c create mode 100644 tools/libxc/xc_save_restore_colo.h -- 1.8.0
We map the dirty pages, and copy them to secondary machine, and then unmap it. xc_map_foreign_bulk() is too slow, so we can''t use full bandwidth to transfer dirty pages. In out test, the transfer speed is less than 300Mb/s. For virtual machine (VM) replication, the transfer speed is very important, so we should cache pages mapping to map the same page only one time. In our test, the transfer speed is about 2Gb/s with this patch. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/libxc/xc_domain_save.c | 113 +++++++++++++++++++++++++------------------ 1 file changed, 66 insertions(+), 47 deletions(-) diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c index fa270f5..222aa03 100644 --- a/tools/libxc/xc_domain_save.c +++ b/tools/libxc/xc_domain_save.c @@ -896,6 +896,50 @@ static int save_tsc_info(xc_interface *xch, uint32_t dom, int io_fd) return 0; } +/* big cache to avoid future map */ +static char **pages_base; + +static int colo_ro_map_and_cache(xc_interface *xch, uint32_t dom, + unsigned long *pfn_batch, xen_pfn_t *pfn_type, + int *pfn_err, int batch) +{ + static xen_pfn_t cache_pfn_type[MAX_BATCH_SIZE]; + static int cache_pfn_err[MAX_BATCH_SIZE]; + int i, cache_batch = 0; + char *map; + + for (i = 0; i < batch; i++) + { + if (!pages_base[pfn_batch[i]]) + cache_pfn_type[cache_batch++] = pfn_type[i]; + } + + if (cache_batch) + { + map = xc_map_foreign_bulk(xch, dom, PROT_READ, cache_pfn_type, cache_pfn_err, cache_batch); + if (!map) + return -1; + } + + cache_batch = 0; + for (i = 0; i < batch; i++) + { + if (pages_base[pfn_batch[i]]) + { + pfn_err[i] = 0; + } + else + { + if (!cache_pfn_err[cache_batch]) + pages_base[pfn_batch[i]] = map + PAGE_SIZE * cache_batch; + pfn_err[i] = cache_pfn_err[cache_batch]; + cache_batch++; + } + } + + return 0; +} + int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters, uint32_t max_factor, uint32_t flags, struct save_callbacks* callbacks, int hvm) @@ -927,9 +971,6 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter /* Live mapping of shared info structure */ shared_info_any_t *live_shinfo = NULL; - /* base of the region in which domain memory is mapped */ - unsigned char *region_base = NULL; - /* A copy of the CPU eXtended States of the guest. */ DECLARE_HYPERCALL_BUFFER(void, buffer); @@ -1111,6 +1152,14 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter memset(pfn_type, 0, ROUNDUP(MAX_BATCH_SIZE * sizeof(*pfn_type), PAGE_SHIFT)); + pages_base = calloc(dinfo->p2m_size, sizeof(*pages_base)); + if (!pages_base) + { + ERROR("failed to alloc memory to cache page mapping"); + errno = ENOMEM; + goto out; + } + /* Setup the mfn_to_pfn table mapping */ if ( !(ctx->live_m2p = xc_map_m2p(xch, ctx->max_mfn, PROT_READ, &ctx->m2p_mfn0)) ) { @@ -1308,9 +1357,8 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter if ( batch == 0 ) goto skip; /* vanishingly unlikely... */ - region_base = xc_map_foreign_bulk( - xch, dom, PROT_READ, pfn_type, pfn_err, batch); - if ( region_base == NULL ) + if (colo_ro_map_and_cache(xch, dom, pfn_batch, pfn_type, pfn_err, + batch) < 0) { PERROR("map batch failed"); goto out; @@ -1356,7 +1404,7 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter DPRINTF("%d pfn=%08lx sum=%08lx\n", iter, pfn_type[j], - csum_page(region_base + (PAGE_SIZE*j))); + csum_page(pages_base[pfn_batch[j]])); else DPRINTF("%d pfn= %08lx mfn= %08lx [mfn]= %08lx" " sum= %08lx\n", @@ -1364,13 +1412,12 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter pfn_type[j], gmfn, mfn_to_pfn(gmfn), - csum_page(region_base + (PAGE_SIZE*j))); + csum_page(pages_base[pfn_batch[j]])); } } if ( !run ) { - munmap(region_base, batch*PAGE_SIZE); continue; /* bail on this batch: no valid pages */ } @@ -1393,33 +1440,14 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter pfn_type[j] = ((unsigned long *)pfn_type)[j]; /* entering this loop, pfn_type is now in pfns (Not mfns) */ - run = 0; for ( j = 0; j < batch; j++ ) { unsigned long pfn, pagetype; - void *spage = (char *)region_base + (PAGE_SIZE*j); + void *spage = pages_base[pfn_batch[j]]; pfn = pfn_type[j] & ~XEN_DOMCTL_PFINFO_LTAB_MASK; pagetype = pfn_type[j] & XEN_DOMCTL_PFINFO_LTAB_MASK; - if ( pagetype != 0 ) - { - /* If the page is not a normal data page, write out any - run of pages we may have previously acumulated */ - if ( run ) - { - if ( ratewrite(io_fd, live, - (char*)region_base+(PAGE_SIZE*(j-run)), - PAGE_SIZE*run) != PAGE_SIZE*run ) - { - PERROR("Error when writing to state file (4a)" - " (errno %d)", errno); - goto out; - } - run = 0; - } - } - /* skip pages that aren''t present */ if ( pagetype == XEN_DOMCTL_PFINFO_XTAB ) continue; @@ -1449,28 +1477,19 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter } else { - /* We have a normal page: accumulate it for writing. */ - run++; + /* stop accumulate write temporarily. we will add it + * back via writev() when needed. + */ + if (ratewrite(io_fd, live, spage, PAGE_SIZE) != PAGE_SIZE) + { + PERROR("Error when writing to state file (4c)" + " (errno %d)", errno); + goto out; + } } } /* end of the write out for this batch */ - if ( run ) - { - /* write out the last accumulated run of pages */ - if ( ratewrite(io_fd, live, - (char*)region_base+(PAGE_SIZE*(j-run)), - PAGE_SIZE*run) != PAGE_SIZE*run ) - { - PERROR("Error when writing to state file (4c)" - " (errno %d)", errno); - goto out; - } - } - sent_this_iter += batch; - - munmap(region_base, batch*PAGE_SIZE); - } /* end of this while loop for this iteration */ skip: -- 1.8.0
Wen Congyang
2013-Apr-03 08:02 UTC
[RFC PATCH 2/7] xc_domain_restore: introduce restore_callbacks for colo
In colo mode, SVM also runs. So we should update xc_restore to support it. The first step is: add some callbacks for colo. We add the following callbacks: 1. init(): init the private data used for colo 2. free(): free the resource we allocate and store in the private data 3. get_page(): SVM runs, so we can''t update the memory in apply_batch(). This callback will return a page buffer, and apply_batch() will copy the page to this buffer. The content of this buffer should be the current content of this page, so we can use it to do verify. 4. flush_memory(): update the SVM memory and pagetable. 5. update_p2m(): update the SVM p2m page. 6. finish_restore(): wait a new checkpoint. We also add a new structure restore_data to avoid pass too many arguments to these callbacks. This structure stores the variables used in xc_domain_store(), and these variables will be used in the callback. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/libxc/ia64/xc_ia64_linux_restore.c | 3 +- tools/libxc/xc_domain_restore.c | 256 +++++++++++++++++++++---------- tools/libxc/xenguest.h | 54 ++++++- tools/libxl/libxl_dom.c | 2 +- tools/xcutils/xc_restore.c | 3 +- 5 files changed, 230 insertions(+), 88 deletions(-) diff --git a/tools/libxc/ia64/xc_ia64_linux_restore.c b/tools/libxc/ia64/xc_ia64_linux_restore.c index b4e9e9c..ca76be6 100644 --- a/tools/libxc/ia64/xc_ia64_linux_restore.c +++ b/tools/libxc/ia64/xc_ia64_linux_restore.c @@ -550,7 +550,8 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, unsigned int store_evtchn, unsigned long *store_mfn, unsigned int console_evtchn, unsigned long *console_mfn, - unsigned int hvm, unsigned int pae, int superpages) + unsigned int hvm, unsigned int pae, int superpages, + struct restore_callbacks *callbacks) { DECLARE_DOMCTL; int rc = 1; diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c index 43e6c52..fa828e9 100644 --- a/tools/libxc/xc_domain_restore.c +++ b/tools/libxc/xc_domain_restore.c @@ -882,13 +882,15 @@ static int pagebuf_get(xc_interface *xch, struct restore_ctx *ctx, static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx, xen_pfn_t* region_mfn, unsigned long* pfn_type, int pae_extended_cr3, unsigned int hvm, struct xc_mmu* mmu, - pagebuf_t* pagebuf, int curbatch) + pagebuf_t* pagebuf, int curbatch, + struct restore_callbacks *callbacks) { int i, j, curpage, nr_mfns; /* used by debug verify code */ unsigned long buf[PAGE_SIZE/sizeof(unsigned long)]; /* Our mapping of the current region (batch) */ char *region_base; + char *target_buf; /* A temporary mapping, and a copy, of one frame of guest memory. */ unsigned long *page = NULL; int nraces = 0; @@ -954,16 +956,19 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx, } } - /* Map relevant mfns */ - pfn_err = calloc(j, sizeof(*pfn_err)); - region_base = xc_map_foreign_bulk( - xch, dom, PROT_WRITE, region_mfn, pfn_err, j); - - if ( region_base == NULL ) + if ( !callbacks || !callbacks->get_page) { - PERROR("map batch failed"); - free(pfn_err); - return -1; + /* Map relevant mfns */ + pfn_err = calloc(j, sizeof(*pfn_err)); + region_base = xc_map_foreign_bulk( + xch, dom, PROT_WRITE, region_mfn, pfn_err, j); + + if ( region_base == NULL ) + { + PERROR("map batch failed"); + free(pfn_err); + return -1; + } } for ( i = 0, curpage = -1; i < j; i++ ) @@ -975,7 +980,7 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx, /* a bogus/unmapped page: skip it */ continue; - if (pfn_err[i]) + if ( (!callbacks || !callbacks->get_page) && pfn_err[i] ) { ERROR("unexpected PFN mapping failure"); goto err_mapped; @@ -993,8 +998,20 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx, mfn = ctx->p2m[pfn]; + if ( callbacks && callbacks->get_page ) + { + target_buf = callbacks->get_page(&callbacks->comm_data, + callbacks->data, pfn); + if ( !target_buf ) + { + ERROR("Cannot get a buffer to store memory"); + goto err_mapped; + } + } + else + target_buf = region_base + i*PAGE_SIZE; /* In verify mode, we use a copy; otherwise we work in place */ - page = pagebuf->verify ? (void *)buf : (region_base + i*PAGE_SIZE); + page = pagebuf->verify ? (void *)buf : target_buf; memcpy(page, pagebuf->pages + (curpage + curbatch) * PAGE_SIZE, PAGE_SIZE); @@ -1038,27 +1055,26 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx, if ( pagebuf->verify ) { - int res = memcmp(buf, (region_base + i*PAGE_SIZE), PAGE_SIZE); + int res = memcmp(buf, target_buf, PAGE_SIZE); if ( res ) { int v; DPRINTF("************** pfn=%lx type=%lx gotcs=%08lx " "actualcs=%08lx\n", pfn, pagebuf->pfn_types[pfn], - csum_page(region_base + (i + curbatch)*PAGE_SIZE), + csum_page(target_buf), csum_page(buf)); for ( v = 0; v < 4; v++ ) { - unsigned long *p = (unsigned long *) - (region_base + i*PAGE_SIZE); + unsigned long *p = (unsigned long *)target_buf; if ( buf[v] != p[v] ) DPRINTF(" %d: %08lx %08lx\n", v, buf[v], p[v]); } } } - if ( !hvm && + if ( (!callbacks || !callbacks->get_page) && !hvm && xc_add_mmu_update(xch, mmu, (((unsigned long long)mfn) << PAGE_SHIFT) | MMU_MACHPHYS_UPDATE, pfn) ) @@ -1071,8 +1087,11 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx, rc = nraces; err_mapped: - munmap(region_base, j*PAGE_SIZE); - free(pfn_err); + if ( !callbacks || !callbacks->get_page ) + { + munmap(region_base, j*PAGE_SIZE); + free(pfn_err); + } return rc; } @@ -1080,7 +1099,8 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx, int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, unsigned int store_evtchn, unsigned long *store_mfn, unsigned int console_evtchn, unsigned long *console_mfn, - unsigned int hvm, unsigned int pae, int superpages) + unsigned int hvm, unsigned int pae, int superpages, + struct restore_callbacks *callbacks) { DECLARE_DOMCTL; int rc = 1, frc, i, j, n, m, pae_extended_cr3 = 0, ext_vcpucontext = 0; @@ -1141,6 +1161,9 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, static struct restore_ctx *ctx = &_ctx; struct domain_info_context *dinfo = &ctx->dinfo; + struct restore_data *comm_data = NULL; + void *data = NULL; + pagebuf_init(&pagebuf); memset(&tailbuf, 0, sizeof(tailbuf)); tailbuf.ishvm = hvm; @@ -1249,6 +1272,32 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, goto out; } + /* init callbacks->comm_data */ + if ( callbacks ) + { + callbacks->comm_data.xch = xch; + callbacks->comm_data.dom = dom; + callbacks->comm_data.dinfo = dinfo; + callbacks->comm_data.hvm = hvm; + callbacks->comm_data.pfn_type = pfn_type; + callbacks->comm_data.mmu = mmu; + callbacks->comm_data.p2m_frame_list = p2m_frame_list; + callbacks->comm_data.p2m = ctx->p2m; + comm_data = &callbacks->comm_data; + + /* init callbacks->data */ + if ( callbacks->init) + { + callbacks->data = NULL; + if (callbacks->init(&callbacks->comm_data, &callbacks->data) < 0 ) + { + ERROR("Could not initialise restore callbacks private data"); + goto out; + } + } + data = callbacks->data; + } + xc_report_progress_start(xch, "Reloading memory pages", dinfo->p2m_size); /* @@ -1298,7 +1347,8 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, int brc; brc = apply_batch(xch, dom, ctx, region_mfn, pfn_type, - pae_extended_cr3, hvm, mmu, &pagebuf, curbatch); + pae_extended_cr3, hvm, mmu, &pagebuf, curbatch, + callbacks); if ( brc < 0 ) goto out; @@ -1368,6 +1418,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, goto finish; } +getpages: // DPRINTF("Buffered checkpoint\n"); if ( pagebuf_get(xch, ctx, &pagebuf, io_fd, dom) ) { @@ -1499,58 +1550,69 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, } } - /* - * Pin page tables. Do this after writing to them as otherwise Xen - * will barf when doing the type-checking. - */ - nr_pins = 0; - for ( i = 0; i < dinfo->p2m_size; i++ ) + if ( callbacks && callbacks->flush_memory ) { - if ( (pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 ) - continue; - - switch ( pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) + if ( callbacks->flush_memory(comm_data, data) < 0 ) { - case XEN_DOMCTL_PFINFO_L1TAB: - pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE; - break; + ERROR("Error doing callbacks->flush_memory()"); + goto out; + } + } + else + { + /* + * Pin page tables. Do this after writing to them as otherwise Xen + * will barf when doing the type-checking. + */ + nr_pins = 0; + for ( i = 0; i < dinfo->p2m_size; i++ ) + { + if ( (pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 ) + continue; - case XEN_DOMCTL_PFINFO_L2TAB: - pin[nr_pins].cmd = MMUEXT_PIN_L2_TABLE; - break; + switch ( pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) + { + case XEN_DOMCTL_PFINFO_L1TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE; + break; - case XEN_DOMCTL_PFINFO_L3TAB: - pin[nr_pins].cmd = MMUEXT_PIN_L3_TABLE; - break; + case XEN_DOMCTL_PFINFO_L2TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L2_TABLE; + break; - case XEN_DOMCTL_PFINFO_L4TAB: - pin[nr_pins].cmd = MMUEXT_PIN_L4_TABLE; - break; + case XEN_DOMCTL_PFINFO_L3TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L3_TABLE; + break; - default: - continue; - } + case XEN_DOMCTL_PFINFO_L4TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L4_TABLE; + break; + + default: + continue; + } - pin[nr_pins].arg1.mfn = ctx->p2m[i]; - nr_pins++; + pin[nr_pins].arg1.mfn = ctx->p2m[i]; + nr_pins++; - /* Batch full? Then flush. */ - if ( nr_pins == MAX_PIN_BATCH ) - { - if ( xc_mmuext_op(xch, pin, nr_pins, dom) < 0 ) + /* Batch full? Then flush. */ + if ( nr_pins == MAX_PIN_BATCH ) { - PERROR("Failed to pin batch of %d page tables", nr_pins); - goto out; + if ( xc_mmuext_op(xch, pin, nr_pins, dom) < 0 ) + { + PERROR("Failed to pin batch of %d page tables", nr_pins); + goto out; + } + nr_pins = 0; } - nr_pins = 0; } - } - /* Flush final partial batch. */ - if ( (nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) < 0) ) - { - PERROR("Failed to pin batch of %d page tables", nr_pins); - goto out; + /* Flush final partial batch. */ + if ( (nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) < 0) ) + { + PERROR("Failed to pin batch of %d page tables", nr_pins); + goto out; + } } DPRINTF("Memory reloaded (%ld pages)\n", ctx->nr_pfns); @@ -1767,37 +1829,61 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, /* leave wallclock time. set by hypervisor */ munmap(new_shared_info, PAGE_SIZE); - /* Uncanonicalise the pfn-to-mfn table frame-number list. */ - for ( i = 0; i < P2M_FL_ENTRIES; i++ ) + if ( callbacks && callbacks->update_p2m ) { - pfn = p2m_frame_list[i]; - if ( (pfn >= dinfo->p2m_size) || (pfn_type[pfn] != XEN_DOMCTL_PFINFO_NOTAB) ) + if ( callbacks->update_p2m(comm_data, data) < 0 ) { - ERROR("PFN-to-MFN frame number %i (%#lx) is bad", i, pfn); + ERROR("Error doing callbacks->update_p2m()"); goto out; } - p2m_frame_list[i] = ctx->p2m[pfn]; } - - /* Copy the P2M we''ve constructed to the ''live'' P2M */ - if ( !(ctx->live_p2m = xc_map_foreign_pages(xch, dom, PROT_WRITE, - p2m_frame_list, P2M_FL_ENTRIES)) ) + else { - PERROR("Couldn''t map p2m table"); - goto out; + /* Uncanonicalise the pfn-to-mfn table frame-number list. */ + for ( i = 0; i < P2M_FL_ENTRIES; i++ ) + { + pfn = p2m_frame_list[i]; + if ( (pfn >= dinfo->p2m_size) || (pfn_type[pfn] != XEN_DOMCTL_PFINFO_NOTAB) ) + { + ERROR("PFN-to-MFN frame number %i (%#lx) is bad", i, pfn); + goto out; + } + p2m_frame_list[i] = ctx->p2m[pfn]; + } + + /* Copy the P2M we''ve constructed to the ''live'' P2M */ + if ( !(ctx->live_p2m = xc_map_foreign_pages(xch, dom, PROT_WRITE, + p2m_frame_list, P2M_FL_ENTRIES)) ) + { + PERROR("Couldn''t map p2m table"); + goto out; + } + + /* If the domain we''re restoring has a different word size to ours, + * we need to adjust the live_p2m assignment appropriately */ + if ( dinfo->guest_width > sizeof (xen_pfn_t) ) + for ( i = dinfo->p2m_size - 1; i >= 0; i-- ) + ((int64_t *)ctx->live_p2m)[i] = (long)ctx->p2m[i]; + else if ( dinfo->guest_width < sizeof (xen_pfn_t) ) + for ( i = 0; i < dinfo->p2m_size; i++ ) + ((uint32_t *)ctx->live_p2m)[i] = ctx->p2m[i]; + else + memcpy(ctx->live_p2m, ctx->p2m, dinfo->p2m_size * sizeof(xen_pfn_t)); + munmap(ctx->live_p2m, P2M_FL_ENTRIES * PAGE_SIZE); } - /* If the domain we''re restoring has a different word size to ours, - * we need to adjust the live_p2m assignment appropriately */ - if ( dinfo->guest_width > sizeof (xen_pfn_t) ) - for ( i = dinfo->p2m_size - 1; i >= 0; i-- ) - ((int64_t *)ctx->live_p2m)[i] = (long)ctx->p2m[i]; - else if ( dinfo->guest_width < sizeof (xen_pfn_t) ) - for ( i = 0; i < dinfo->p2m_size; i++ ) - ((uint32_t *)ctx->live_p2m)[i] = ctx->p2m[i]; - else - memcpy(ctx->live_p2m, ctx->p2m, dinfo->p2m_size * sizeof(xen_pfn_t)); - munmap(ctx->live_p2m, P2M_FL_ENTRIES * PAGE_SIZE); + if ( callbacks && callbacks->finish_restotre ) + { + rc = callbacks->finish_restotre(comm_data, data); + if ( rc == 1 ) + goto getpages; + + if ( rc < 0 ) + { + ERROR("Er1ror doing callbacks->finish_restotre()"); + goto out; + } + } DPRINTF("Domain ready to be built.\n"); rc = 0; @@ -1861,6 +1947,8 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, rc = 0; out: + if ( callbacks && callbacks->free && callbacks->data) + callbacks->free(&callbacks->comm_data, callbacks->data); if ( (rc != 0) && (dom != 0) ) xc_domain_destroy(xch, dom); xc_hypercall_buffer_free(xch, ctxt); diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h index 9ed0ea4..709a284 100644 --- a/tools/libxc/xenguest.h +++ b/tools/libxc/xenguest.h @@ -60,6 +60,57 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter struct save_callbacks* callbacks, int hvm); +/* pass the variable defined in xc_domain_restore() to callback. Use + * this structure for the following purpose: + * 1. avoid too many arguments. + * 2. different callback implemention may need different arguments. + * Just add the information you need here. + */ +struct restore_data +{ + xc_interface *xch; + uint32_t dom; + struct domain_info_context *dinfo; + int hvm; + unsigned long *pfn_type; + struct xc_mmu *mmu; + unsigned long *p2m_frame_list; + unsigned long *p2m; +}; + +/* callbacks provided by xc_domain_restore */ +struct restore_callbacks { + /* callback to init data */ + int (*init)(struct restore_data *comm_data, void **data); + /* callback to free data */ + void (*free)(struct restore_data *comm_data, void *data); + /* callback to get a buffer to store memory data that is transfered + * from the source machine. + */ + char *(*get_page)(struct restore_data *comm_data, void *data, + unsigned long pfn); + /* callback to flush memory that is transfered from the source machine + * to the guest. Update the guest''s pagetable if necessary. + */ + int (*flush_memory)(struct restore_data *comm_data, void *data); + /* callback to update the guest''s p2m table */ + int (*update_p2m)(struct restore_data *comm_data, void *data); + /* callback to finish restore process. It is called before xc_domain_restore() + * returns. + * + * Return value: + * -1: error + * 0: continue to start vm + * 1: continue to do a checkpoint + */ + int (*finish_restotre)(struct restore_data *comm_data, void *data); + + /* xc_domain_restore() init it */ + struct restore_data comm_data; + /* to be provided as the last argument to each callback function */ + void* data; +}; + /** * This function will restore a saved domain. * @@ -76,7 +127,8 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, unsigned int store_evtchn, unsigned long *store_mfn, unsigned int console_evtchn, unsigned long *console_mfn, - unsigned int hvm, unsigned int pae, int superpages); + unsigned int hvm, unsigned int pae, int superpages, + struct restore_callbacks *callbacks); /** * xc_domain_restore writes a file to disk that contains the device * model saved state. diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index c702cf7..32cdd03 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -305,7 +305,7 @@ int libxl__domain_restore_common(libxl_ctx *ctx, uint32_t domid, rc = xc_domain_restore(ctx->xch, fd, domid, state->store_port, &state->store_mfn, state->console_port, &state->console_mfn, - info->hvm, info->u.hvm.pae, 0); + info->hvm, info->u.hvm.pae, 0, NULL); if ( rc ) { LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "restoring domain"); return ERROR_FAIL; diff --git a/tools/xcutils/xc_restore.c b/tools/xcutils/xc_restore.c index ea069ac..8af88e4 100644 --- a/tools/xcutils/xc_restore.c +++ b/tools/xcutils/xc_restore.c @@ -46,7 +46,8 @@ main(int argc, char **argv) superpages = 0; ret = xc_domain_restore(xch, io_fd, domid, store_evtchn, &store_mfn, - console_evtchn, &console_mfn, hvm, pae, superpages); + console_evtchn, &console_mfn, hvm, pae, superpages, + NULL); if ( ret == 0 ) { -- 1.8.0
This patch implements restore callbacks for colo: 1. init(): allocate some memory 2. free(): free the memory allocated in init() 3. get_page(): We have cache the whole memory, so just return the buffer. This page is also marked as dirty. 4. flush_memory(): We update the memory as the following: a. pin non-dirty L1 pagetables b. unpin pagetables execpt non-dirty L1 c. update the memory d. pin page tables e. unpin non-dirty L1 pagetables 5. update_p2m(): Just update the dirty pages which store p2m. 6. finish_store(): We run xc_restore in XendCheckpoint.py. We communicate with XendCheckpoint.py like this: a. write "finish\n" to stdout when we are ready to resume the vm. b. XendCheckpoint.py writes "resume\n" when the vm is resumed c. write "resume\n" to stdout when postresume is done d. XendCheckpoint.py writes "suspend\n" when a new checkpoint begins e. write "suspend\n" to stdout when the vm is suspended f. XendCheckpoint.py writes "start\n" when primary begins to transfer dirty pages. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/libxc/Makefile | 4 +- tools/libxc/xc_domain_restore_colo.c | 740 +++++++++++++++++++++++++++++++++++ tools/libxc/xc_domain_save.c | 34 +- tools/libxc/xc_save_restore_colo.h | 44 +++ xen/include/public/xen.h | 1 + 5 files changed, 788 insertions(+), 35 deletions(-) create mode 100644 tools/libxc/xc_domain_restore_colo.c create mode 100644 tools/libxc/xc_save_restore_colo.h diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile index 5a7677e..e2d059d 100644 --- a/tools/libxc/Makefile +++ b/tools/libxc/Makefile @@ -42,12 +42,12 @@ CTRL_SRCS-$(CONFIG_MiniOS) += xc_minios.c GUEST_SRCS-y : GUEST_SRCS-y += xg_private.c xc_suspend.c -GUEST_SRCS-$(CONFIG_MIGRATE) += xc_domain_restore.c xc_domain_save.c +GUEST_SRCS-$(CONFIG_MIGRATE) += xc_domain_restore.c xc_domain_save.c xc_domain_restore_colo.c GUEST_SRCS-$(CONFIG_MIGRATE) += xc_offline_page.c GUEST_SRCS-$(CONFIG_HVM) += xc_hvm_build.c vpath %.c ../../xen/common/libelf -CFLAGS += -I../../xen/common/libelf +CFLAGS += -I../../xen/common/libelf -I../xenstore GUEST_SRCS-y += libelf-tools.c libelf-loader.c GUEST_SRCS-y += libelf-dominfo.c libelf-relocate.c diff --git a/tools/libxc/xc_domain_restore_colo.c b/tools/libxc/xc_domain_restore_colo.c new file mode 100644 index 0000000..ffc7daa --- /dev/null +++ b/tools/libxc/xc_domain_restore_colo.c @@ -0,0 +1,740 @@ +#include <xc_save_restore_colo.h> +#include <xs.h> + +struct restore_colo_data +{ + /* store the pfn type on slave side */ + unsigned long *pfn_type_slaver; + unsigned long max_mem_pfn; + + /* cache the whole memory */ + char* pagebase; + + /* which page is dirty? */ + unsigned long *dirty_pages; + + /* suspend evtchn */ + int local_port; + + xc_evtchn *xce; + + /* temp buffer(avoid malloc/free frequently) */ + unsigned long *pfn_batch_slaver; + unsigned long *pfn_type_batch_slaver; + unsigned long *p2m_frame_list_temp; + + int first_time; +}; + +/* we restore only one vm in a process, so it is same to use global variable */ +DECLARE_HYPERCALL_BUFFER(unsigned long, dirty_pages); + +int restore_colo_init(struct restore_data *comm_data, void **data) +{ + xc_dominfo_t info; + int i; + unsigned long size; + xc_interface *xch = comm_data->xch; + struct restore_colo_data *colo_data; + struct domain_info_context *dinfo = comm_data->dinfo; + + if (comm_data->hvm) + /* hvm is unsupported now */ + return -1; + + if (dirty_pages) + /* restore_colo_init() is called more than once?? */ + return -1; + + colo_data = calloc(1, sizeof(struct restore_colo_data)); + if (!colo_data) + return -1; + + if (xc_domain_getinfo(xch, comm_data->dom, 1, &info) != 1) + { + PERROR("Could not get domain info"); + goto err; + } + + colo_data->max_mem_pfn = info.max_memkb >> (PAGE_SHIFT - 10); + + colo_data->pfn_type_slaver = calloc(dinfo->p2m_size, sizeof(xen_pfn_t)); + colo_data->pfn_batch_slaver = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t)); + colo_data->pfn_type_batch_slaver = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t)); + colo_data->p2m_frame_list_temp = malloc(P2M_FL_ENTRIES); + + dirty_pages = xc_hypercall_buffer_alloc_pages(xch, dirty_pages, NRPAGES(BITMAP_SIZE)); + colo_data->dirty_pages = dirty_pages; + + size = dinfo->p2m_size * PAGE_SIZE; + colo_data->pagebase = malloc(size); + if (colo_data->pfn_type_slaver || colo_data->pfn_batch_slaver || + colo_data->pfn_type_batch_slaver || colo_data->p2m_frame_list_temp || + colo_data->dirty_pages || colo_data->pagebase) { + PERROR("Could not allocate memory for restore colo data"); + goto err; + } + + colo_data->xce = xc_evtchn_open(NULL, 0); + if (!colo_data->xce) { + PERROR("Could not open evtchn"); + goto err; + } + + for (i = 0; i < dinfo->p2m_size; i++) + comm_data->pfn_type[i] = XEN_DOMCTL_PFINFO_XTAB; + memset(dirty_pages, 0xff, BITMAP_SIZE); + colo_data->first_time = 1; + colo_data->local_port = -1; + *data = colo_data; + + return 0; + +err: + restore_colo_free(comm_data, colo_data); + *data = NULL; + return -1; +} + +void restore_colo_free(struct restore_data *comm_data, void *data) +{ + struct restore_colo_data *colo_data = data; + + if (!colo_data) + return; + + free(colo_data->pfn_type_slaver); + free(colo_data->pagebase); + free(colo_data->pfn_batch_slaver); + free(colo_data->pfn_type_batch_slaver); + free(colo_data->p2m_frame_list_temp); + if (dirty_pages) + xc_hypercall_buffer_free(comm_data->xch, dirty_pages); + if (colo_data->xce) + xc_evtchn_close(colo_data->xce); + free(colo_data); +} + +char* get_page(struct restore_data *comm_data, void *data, + unsigned long pfn) +{ + struct restore_colo_data *colo_data = data; + + set_bit(pfn, colo_data->dirty_pages); + return colo_data->pagebase + pfn * PAGE_SIZE; +} + +/* Step1: pin non-dirty L1 pagetables: ~dirty_pages & mL1 (= ~dirty_pages & sL1) */ +static int pin_l1(struct restore_data *comm_data, + struct restore_colo_data *colo_data) +{ + unsigned int nr_pins = 0; + unsigned long i; + struct mmuext_op pin[MAX_PIN_BATCH]; + struct domain_info_context *dinfo = comm_data->dinfo; + unsigned long *pfn_type = comm_data->pfn_type; + uint32_t dom = comm_data->dom; + xc_interface *xch = comm_data->xch; + unsigned long *pfn_type_slaver = colo_data->pfn_type_slaver; + unsigned long *dirty_pages = colo_data->dirty_pages; + + for (i = 0; i < dinfo->p2m_size; i++) + { + switch ( pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) + { + case XEN_DOMCTL_PFINFO_L1TAB: + if (pfn_type_slaver[i] & XEN_DOMCTL_PFINFO_LPINTAB) + /* don''t pin already pined */ + continue; + + if (test_bit(i, dirty_pages)) + /* don''t pin dirty */ + continue; + + /* here, it must also be L1 in slaver, otherwise it is dirty. + * (add test code ?) + */ + pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE; + break; + + case XEN_DOMCTL_PFINFO_L2TAB: + case XEN_DOMCTL_PFINFO_L3TAB: + case XEN_DOMCTL_PFINFO_L4TAB: + default: + continue; + } + + pin[nr_pins].arg1.mfn = comm_data->p2m[i]; + nr_pins++; + + /* Batch full? Then flush. */ + if (nr_pins == MAX_PIN_BATCH) + { + if (xc_mmuext_op(xch, pin, nr_pins, dom) < 0) + { + PERROR("Failed to pin L1 batch of %d page tables", nr_pins); + return 1; + } + nr_pins = 0; + } + } + + /* Flush final partial batch. */ + if ((nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) < 0)) + { + PERROR("Failed to pin L1 batch of %d page tables", nr_pins); + return 1; + } + + return 0; +} + +/* Step2: unpin pagetables execpt non-dirty L1: sL2 + sL3 + sL4 + (dirty_pages & sL1) */ +static int unpin_pagetable(struct restore_data *comm_data, + struct restore_colo_data *colo_data) +{ + unsigned int nr_pins = 0; + unsigned long i; + struct mmuext_op pin[MAX_PIN_BATCH]; + struct domain_info_context *dinfo = comm_data->dinfo; + uint32_t dom = comm_data->dom; + xc_interface *xch = comm_data->xch; + unsigned long *pfn_type_slaver = colo_data->pfn_type_slaver; + unsigned long *dirty_pages = colo_data->dirty_pages; + + for (i = 0; i < dinfo->p2m_size; i++) + { + if ( (pfn_type_slaver[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 ) + continue; + + switch ( pfn_type_slaver[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) + { + case XEN_DOMCTL_PFINFO_L1TAB: + if (!test_bit(i, dirty_pages)) // it is in (~dirty_pages & mL1), keep it + continue; + // fallthrough + case XEN_DOMCTL_PFINFO_L2TAB: + case XEN_DOMCTL_PFINFO_L3TAB: + case XEN_DOMCTL_PFINFO_L4TAB: + pin[nr_pins].cmd = MMUEXT_UNPIN_TABLE; + break; + + default: + continue; + } + + pin[nr_pins].arg1.mfn = comm_data->p2m[i]; + nr_pins++; + + /* Batch full? Then flush. */ + if (nr_pins == MAX_PIN_BATCH) + { + if (xc_mmuext_op(xch, pin, nr_pins, dom) < 0) + { + PERROR("Failed to unpin batch of %d page tables", nr_pins); + return 1; + } + nr_pins = 0; + } + } + + /* Flush final partial batch. */ + if ((nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) < 0)) + { + PERROR("Failed to unpin batch of %d page tables", nr_pins); + return 1; + } + + return 0; +} + +/* we have unpined all pagetables except non-diry l1. So it is OK to map the dirty memory + * and update it. + */ +static int update_memory(struct restore_data *comm_data, + struct restore_colo_data *colo_data) +{ + unsigned long pfn; + unsigned long max_mem_pfn = colo_data->max_mem_pfn; + unsigned long *pfn_type = comm_data->pfn_type; + unsigned long pagetype; + uint32_t dom = comm_data->dom; + xc_interface *xch = comm_data->xch; + int hvm = comm_data->hvm; + struct xc_mmu *mmu = comm_data->mmu; + unsigned long *dirty_pages = colo_data->dirty_pages; + char *pagebase = colo_data->pagebase; + int pfn_err = 0; + char *region_base_slaver; + xen_pfn_t region_mfn_slaver; + unsigned long mfn; + char *pagebuff; + + for (pfn = 0; pfn < max_mem_pfn; pfn++) { + if ( !test_bit(pfn, dirty_pages) ) + continue; + + pagetype = pfn_type[pfn] & XEN_DOMCTL_PFINFO_LTAB_MASK; + if (pagetype == XEN_DOMCTL_PFINFO_XTAB) + /* a bogus/unmapped page: skip it */ + continue; + + mfn = comm_data->p2m[pfn]; + region_mfn_slaver = mfn; + region_base_slaver = xc_map_foreign_bulk(xch, dom, + PROT_WRITE, ®ion_mfn_slaver, &pfn_err, 1); + if (!region_base_slaver || pfn_err) { + PERROR("update_memory: xc_map_foreign_bulk failed"); + return 1; + } + + pagebuff = (char *)(pagebase + pfn * PAGE_SIZE); + memcpy(region_base_slaver, pagebuff, PAGE_SIZE); + munmap(region_base_slaver, PAGE_SIZE); + + if (!hvm && + xc_add_mmu_update(xch, mmu, + (((unsigned long long)mfn) << PAGE_SHIFT) + | MMU_MACHPHYS_UPDATE, pfn) ) + { + PERROR("failed machpys update mfn=%lx pfn=%lx", mfn, pfn); + return 1; + } + } + + /* + * Ensure we flush all machphys updates before potential PAE-specific + * reallocations below. + */ + if (!hvm && xc_flush_mmu_updates(xch, mmu)) + { + PERROR("Error doing flush_mmu_updates()"); + return 1; + } + + return 0; +} + +/* Step 4: pin master pt + * Pin page tables. Do this after writing to them as otherwise Xen + * will barf when doing the type-checking. + */ +static int pin_pagetable(struct restore_data *comm_data, + struct restore_colo_data *colo_data) +{ + unsigned int nr_pins = 0; + unsigned long i; + struct mmuext_op pin[MAX_PIN_BATCH]; + struct domain_info_context *dinfo = comm_data->dinfo; + unsigned long *pfn_type = comm_data->pfn_type; + uint32_t dom = comm_data->dom; + xc_interface *xch = comm_data->xch; + unsigned long *dirty_pages = colo_data->dirty_pages; + + for ( i = 0; i < dinfo->p2m_size; i++ ) + { + if ( (pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 ) + continue; + + switch ( pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) + { + case XEN_DOMCTL_PFINFO_L1TAB: + if (!test_bit(i, dirty_pages)) + /* it is in (~dirty_pages & mL1)(=~dirty_pages & sL1), + * already pined + */ + continue; + + pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE; + break; + + case XEN_DOMCTL_PFINFO_L2TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L2_TABLE; + break; + + case XEN_DOMCTL_PFINFO_L3TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L3_TABLE; + break; + + case XEN_DOMCTL_PFINFO_L4TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L4_TABLE; + break; + + default: + continue; + } + + pin[nr_pins].arg1.mfn = comm_data->p2m[i]; + nr_pins++; + + /* Batch full? Then flush. */ + if (nr_pins == MAX_PIN_BATCH) + { + if (xc_mmuext_op(xch, pin, nr_pins, dom) < 0) + { + PERROR("Failed to pin batch of %d page tables", nr_pins); + return 1; + } + nr_pins = 0; + } + } + + /* Flush final partial batch. */ + if ((nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) < 0)) + { + PERROR("Failed to pin batch of %d page tables", nr_pins); + return 1; + } + + return 0; +} + +/* Step5: unpin unneeded non-dirty L1 pagetables: ~dirty_pages & mL1 (= ~dirty_pages & sL1) */ +static int unpin_l1(struct restore_data *comm_data, + struct restore_colo_data *colo_data) +{ + unsigned int nr_pins = 0; + unsigned long i; + struct mmuext_op pin[MAX_PIN_BATCH]; + struct domain_info_context *dinfo = comm_data->dinfo; + unsigned long *pfn_type = comm_data->pfn_type; + uint32_t dom = comm_data->dom; + xc_interface *xch = comm_data->xch; + unsigned long *pfn_type_slaver = colo_data->pfn_type_slaver; + unsigned long *dirty_pages = colo_data->dirty_pages; + + for (i = 0; i < dinfo->p2m_size; i++) + { + switch ( pfn_type_slaver[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) + { + case XEN_DOMCTL_PFINFO_L1TAB: + if (pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) // still needed + continue; + if (test_bit(i, dirty_pages)) // not pined by step 1 + continue; + + pin[nr_pins].cmd = MMUEXT_UNPIN_TABLE; + break; + + case XEN_DOMCTL_PFINFO_L2TAB: + case XEN_DOMCTL_PFINFO_L3TAB: + case XEN_DOMCTL_PFINFO_L4TAB: + default: + continue; + } + + pin[nr_pins].arg1.mfn = comm_data->p2m[i]; + nr_pins++; + + /* Batch full? Then flush. */ + if (nr_pins == MAX_PIN_BATCH) + { + if (xc_mmuext_op(xch, pin, nr_pins, dom) < 0) + { + PERROR("Failed to pin L1 batch of %d page tables", nr_pins); + return 1; + } + nr_pins = 0; + } + } + + /* Flush final partial batch. */ + if ((nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) < 0)) + { + PERROR("Failed to pin L1 batch of %d page tables", nr_pins); + return 1; + } + + return 0; +} + +int flush_memory(struct restore_data *comm_data, void *data) +{ + struct restore_colo_data *colo_data = data; + + if (pin_l1(comm_data, colo_data) != 0) + return -1; + if (unpin_pagetable(comm_data, colo_data) != 0) + return -1; + if (update_memory(comm_data, colo_data) != 0) + return -1; + if (pin_pagetable(comm_data, colo_data) != 0) + return -1; + if (unpin_l1(comm_data, colo_data) != 0) + return -1; + + memcpy(colo_data->pfn_type_slaver, comm_data->pfn_type, + comm_data->dinfo->p2m_size * sizeof(xen_pfn_t)); + + return 0; +} + +int update_p2m_table(struct restore_data *comm_data, void *data) +{ + struct restore_colo_data *colo_data = data; + unsigned long i, j, n, pfn; + unsigned long *p2m_frame_list = comm_data->p2m_frame_list; + struct domain_info_context *dinfo = comm_data->dinfo; + unsigned long *pfn_type = comm_data->pfn_type; + xc_interface *xch = comm_data->xch; + uint32_t dom = comm_data->dom; + unsigned long *dirty_pages = colo_data->dirty_pages; + unsigned long *p2m_frame_list_temp = colo_data->p2m_frame_list_temp; + + /* A temporay mapping of the guest''s p2m table(all dirty pages) */ + xen_pfn_t *live_p2m; + /* A temporay mapping of the guest''s p2m table(1 page) */ + xen_pfn_t *live_p2m_one; + unsigned long *p2m; + + j = 0; + for (i = 0; i < P2M_FL_ENTRIES; i++) + { + pfn = p2m_frame_list[i]; + if ((pfn >= dinfo->p2m_size) || (pfn_type[pfn] != XEN_DOMCTL_PFINFO_NOTAB)) + { + ERROR("PFN-to-MFN frame number %i (%#lx) is bad", i, pfn); + return -1; + } + + if (!test_bit(pfn, dirty_pages)) + continue; + + p2m_frame_list_temp[j++] = comm_data->p2m[pfn]; + } + + if (j) + { + /* Copy the P2M we''ve constructed to the ''live'' P2M */ + if (!(live_p2m = xc_map_foreign_pages(xch, dom, PROT_WRITE, + p2m_frame_list_temp, j))) + { + PERROR("Couldn''t map p2m table"); + return -1; + } + + j = 0; + for (i = 0; i < P2M_FL_ENTRIES; i++) + { + pfn = p2m_frame_list[i]; + if (!test_bit(pfn, dirty_pages)) + continue; + + live_p2m_one = (xen_pfn_t *)((char *)live_p2m + PAGE_SIZE * j++); + /* If the domain we''re restoring has a different word size to ours, + * we need to adjust the live_p2m assignment appropriately */ + if (dinfo->guest_width > sizeof (xen_pfn_t)) + { + n = (i + 1) * FPP - 1; + for (i = FPP - 1; i >= 0; i--) + ((uint64_t *)live_p2m_one)[i] = (long)comm_data->p2m[n--]; + } + else if (dinfo->guest_width < sizeof (xen_pfn_t)) + { + n = i * FPP; + for (i = 0; i < FPP; i++) + ((uint32_t *)live_p2m_one)[i] = comm_data->p2m[n++]; + } + else + { + p2m = (xen_pfn_t *)((char *)comm_data->p2m + PAGE_SIZE * i); + memcpy(live_p2m_one, p2m, PAGE_SIZE); + } + } + munmap(live_p2m, j * PAGE_SIZE); + } + + return 0; +} + +static int update_pfn_type(xc_interface *xch, uint32_t dom, int count, xen_pfn_t *pfn_batch, + xen_pfn_t *pfn_type_batch, xen_pfn_t *pfn_type) +{ + unsigned long k; + + if (xc_get_pfn_type_batch(xch, dom, count, pfn_type_batch)) + { + ERROR("xc_get_pfn_type_batch for slaver failed"); + return -1; + } + + for (k = 0; k < count; k++) + pfn_type[pfn_batch[k]] = pfn_type_batch[k] & XEN_DOMCTL_PFINFO_LTAB_MASK; + + return 0; +} + +/* we are ready to start the guest when this functions is called. We + * will return until we need to do a new checkpoint or some error occurs. + * + * communication with python + * python code restore code comment + * <==== "finish\n" + * "resume\n" ====> guest is resumed + * <==== "resume\n" postresume is done + * "suspend\n" ====> a new checkpoint begins + * <==== "suspend\n" guest is suspended + * "start\n" ====> getting dirty pages begins + * + * return value: + * -1: error + * 0: continue to start vm + * 1: continue to do a checkpoint + */ +int finish_colo(struct restore_data *comm_data, void *data) +{ + struct restore_colo_data *colo_data = data; + xc_interface *xch = comm_data->xch; + uint32_t dom = comm_data->dom; + struct domain_info_context *dinfo = comm_data->dinfo; + xc_evtchn *xce = colo_data->xce; + unsigned long *pfn_batch_slaver = colo_data->pfn_batch_slaver; + unsigned long *pfn_type_batch_slaver = colo_data->pfn_type_batch_slaver; + unsigned long *pfn_type_slaver = colo_data->pfn_type_slaver; + DECLARE_HYPERCALL; + + unsigned long i, j; + int rc; + char str[10]; + int remote_port; + int local_port = colo_data->local_port; + +#if 0 + /* output the store-mfn & console-mfn */ + printf("store-mfn %li\n", *store_mfn); + printf("console-mfn %li\n", *console_mfn); +#endif + + /* we need to know which pages are dirty to restore the guest */ + if (xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY, NULL, + 0, NULL, 0, NULL) < 0 ) + { + rc = xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_OFF, NULL, 0, + NULL, 0, NULL); + if (rc >= 0) + { + rc = xc_shadow_control(xch, dom, + XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY, NULL, + 0, NULL, 0, NULL); + } + if (rc < 0) + { + ERROR("enabling logdirty fails"); + return -1; + } + } + + /* notify python code checkpoint finish */ + printf("finish\n"); + fflush(stdout); + + /* wait domain resume, then connect the suspend evtchn */ + scanf("%s", str); + + if (colo_data->first_time) { + sleep(10); + remote_port = xs_suspend_evtchn_port(dom); + if (remote_port < 0) { + ERROR("getting remote suspend port fails"); + return -1; + } + + local_port = xc_suspend_evtchn_init(xch, xce, dom, remote_port); + if (local_port < 0) { + ERROR("initializing suspend evtchn fails"); + return -1; + } + + colo_data->local_port = local_port; + } + + /* notify python code vm is resumed */ + printf("resume\n"); + fflush(stdout); + + /* wait for the next checkpoint */ + scanf("%s", str); + if (strcmp(str, "suspend")) + { + ERROR("wait for a new checkpoint fails"); + /* start the guest now? */ + return 0; + } + + /* notify the suspend evtchn */ + rc = xc_evtchn_notify(xce, local_port); + if (rc < 0) + { + ERROR("notifying the suspend evtchn fails"); + return -1; + } + + rc = xc_await_suspend(xch, xce, local_port); + if (rc < 0) + { + ERROR("waiting suspend fails"); + return -1; + } + + /* notify python code suspend is done */ + printf("suspend\n"); + fflush(stdout); + + scanf("%s", str); + + if (strcmp(str, "start")) + return -1; + + memset(colo_data->dirty_pages, 0x0, BITMAP_SIZE); + if (xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_CLEAN, + HYPERCALL_BUFFER(dirty_pages), dinfo->p2m_size, + NULL, 0, NULL) != dinfo->p2m_size) + { + ERROR("getting slaver dirty fails"); + return -1; + } + + if (xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_OFF, NULL, 0, NULL, + 0, NULL) < 0 ) + { + ERROR("disabling dirty-log fails"); + return -1; + } + + j = 0; + for (i = 0; i < colo_data->max_mem_pfn; i++) + { + if ( !test_bit(i, colo_data->dirty_pages) ) + continue; + + pfn_batch_slaver[j] = i; + pfn_type_batch_slaver[j++] = comm_data->p2m[i]; + if (j == MAX_BATCH_SIZE) + { + if (update_pfn_type(xch, dom, j, pfn_batch_slaver, + pfn_type_batch_slaver, pfn_type_slaver)) + { + return -1; + } + j = 0; + } + } + + if (j) + { + if (update_pfn_type(xch, dom, j, pfn_batch_slaver, + pfn_type_batch_slaver, pfn_type_slaver)) + { + return -1; + } + } + + /* reset memory */ + hypercall.op = __HYPERVISOR_reset_memory_op; + hypercall.arg[0] = (unsigned long)dom; + do_xen_hypercall(xch, &hypercall); + + colo_data->first_time = 0; + + return 1; +} diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c index 222aa03..3aafa61 100644 --- a/tools/libxc/xc_domain_save.c +++ b/tools/libxc/xc_domain_save.c @@ -28,8 +28,7 @@ #include "xc_private.h" #include "xc_dom.h" -#include "xg_private.h" -#include "xg_save_restore.h" +#include "xc_save_restore_colo.h" #include <xen/hvm/params.h> #include "xc_e820.h" @@ -82,37 +81,6 @@ struct outbuf { ((mfn_to_pfn(_mfn) < (dinfo->p2m_size)) && \ (pfn_to_mfn(mfn_to_pfn(_mfn)) == (_mfn)))) -/* -** During (live) save/migrate, we maintain a number of bitmaps to track -** which pages we have to send, to fixup, and to skip. -*/ - -#define BITS_PER_LONG (sizeof(unsigned long) * 8) -#define BITS_TO_LONGS(bits) (((bits)+BITS_PER_LONG-1)/BITS_PER_LONG) -#define BITMAP_SIZE (BITS_TO_LONGS(dinfo->p2m_size) * sizeof(unsigned long)) - -#define BITMAP_ENTRY(_nr,_bmap) \ - ((volatile unsigned long *)(_bmap))[(_nr)/BITS_PER_LONG] - -#define BITMAP_SHIFT(_nr) ((_nr) % BITS_PER_LONG) - -#define ORDER_LONG (sizeof(unsigned long) == 4 ? 5 : 6) - -static inline int test_bit (int nr, volatile void * addr) -{ - return (BITMAP_ENTRY(nr, addr) >> BITMAP_SHIFT(nr)) & 1; -} - -static inline void clear_bit (int nr, volatile void * addr) -{ - BITMAP_ENTRY(nr, addr) &= ~(1UL << BITMAP_SHIFT(nr)); -} - -static inline void set_bit ( int nr, volatile void * addr) -{ - BITMAP_ENTRY(nr, addr) |= (1UL << BITMAP_SHIFT(nr)); -} - /* Returns the hamming weight (i.e. the number of bits set) in a N-bit word */ static inline unsigned int hweight32(unsigned int w) { diff --git a/tools/libxc/xc_save_restore_colo.h b/tools/libxc/xc_save_restore_colo.h new file mode 100644 index 0000000..1283c9c --- /dev/null +++ b/tools/libxc/xc_save_restore_colo.h @@ -0,0 +1,44 @@ +#ifndef XC_SAVE_RESTORE_COLO_H +#define XC_SAVE_RESTORE_COLO_H + +#include <xg_save_restore.h> +#include <xg_private.h> + +extern int restore_colo_init(struct restore_data *, void **); +extern void restore_colo_free(struct restore_data *, void *); +extern char* get_page(struct restore_data *, void *, unsigned long); +extern int flush_memory(struct restore_data *, void *); +extern int update_p2m_table(struct restore_data *, void *); +extern int finish_colo(struct restore_data *, void *); + +/* +** During (live) save/migrate, we maintain a number of bitmaps to track +** which pages we have to send, to fixup, and to skip. +*/ + +#define BITS_PER_LONG (sizeof(unsigned long) * 8) +#define BITS_TO_LONGS(bits) (((bits)+BITS_PER_LONG-1)/BITS_PER_LONG) +#define BITMAP_SIZE (BITS_TO_LONGS(dinfo->p2m_size) * sizeof(unsigned long)) + +#define BITMAP_ENTRY(_nr,_bmap) \ + ((volatile unsigned long *)(_bmap))[(_nr)/BITS_PER_LONG] + +#define BITMAP_SHIFT(_nr) ((_nr) % BITS_PER_LONG) + +#define ORDER_LONG (sizeof(unsigned long) == 4 ? 5 : 6) + +static inline int test_bit (int nr, volatile void * addr) +{ + return (BITMAP_ENTRY(nr, addr) >> BITMAP_SHIFT(nr)) & 1; +} + +static inline void clear_bit (int nr, volatile void * addr) +{ + BITMAP_ENTRY(nr, addr) &= ~(1UL << BITMAP_SHIFT(nr)); +} + +static inline void set_bit ( int nr, volatile void * addr) +{ + BITMAP_ENTRY(nr, addr) |= (1UL << BITMAP_SHIFT(nr)); +} +#endif diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h index 93c3fe3..d7ee050 100644 --- a/xen/include/public/xen.h +++ b/xen/include/public/xen.h @@ -93,6 +93,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_pfn_t); #define __HYPERVISOR_domctl 36 #define __HYPERVISOR_kexec_op 37 #define __HYPERVISOR_tmem_op 38 +#define __HYPERVISOR_reset_memory_op 40 /* Architecture-specific hypercall definitions. */ #define __HYPERVISOR_arch_0 48 -- 1.8.0
Wen Congyang
2013-Apr-03 08:02 UTC
[RFC PATCH 4/7] xc_domain_save: flush cache before calling callbacks->postcopy()
callbacks->postcopy() may use the fd to transfer something to the other end, so we should flush cache before calling callbacks->postcopy() Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/libxc/xc_domain_save.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c index 3aafa61..cc4004a 100644 --- a/tools/libxc/xc_domain_save.c +++ b/tools/libxc/xc_domain_save.c @@ -1886,9 +1886,6 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter out: completed = 1; - if ( !rc && callbacks->postcopy ) - callbacks->postcopy(callbacks->data); - /* Flush last write and discard cache for file. */ if ( outbuf_flush(xch, &ob, io_fd) < 0 ) { PERROR("Error when flushing output buffer"); @@ -1897,6 +1894,9 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter discard_file_cache(xch, io_fd, 1 /* flush */); + if ( !rc && callbacks->postcopy ) + callbacks->postcopy(callbacks->data); + /* checkpoint_cb can spend arbitrarily long in between rounds */ if (!rc && callbacks->checkpoint && callbacks->checkpoint(callbacks->data) > 0) -- 1.8.0
Wen Congyang
2013-Apr-03 08:02 UTC
[RFC PATCH 5/7] xc_domain_save: implement save_callbacks for colo
Add a new save callbacks: 1. post_sendstate(): SVM will run only when XC_SAVE_ID_LAST_CHECKPOINT is sent to slaver. But we only sent XC_SAVE_ID_LAST_CHECKPOINT when we do live migration now. Add this callback, and we can send it in this callback. Update some callbacks for colo: 1. suspend(): In colo mode, both PVM and SVM are running. So we should suspend both PVM and SVM. Communicate with slaver like this: a. write "continue" to notify slaver to suspend SVM b. suspend PVM and SVM c. slaver writes "suspend" to tell master that SVM is suspended 2. postcopy(): In colo mode, both PVM and SVM are running, and we have suspended both PVM and SVM. So we should resume PVM and SVM Communicate with slaver like this: a. write "resume" to notify slaver to resume SVM b. resume PVM and SVM c. slaver writes "resume" to tell master that SVM is resumed 3. checkpoint(): In colo mode, we do a new checkpoint only when output packet from PVM and SVM is different. We will block in this callback and return when a output packet is different. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/libxc/xc_domain_save.c | 9 + tools/libxc/xenguest.h | 3 + tools/python/xen/lowlevel/checkpoint/checkpoint.c | 289 +++++++++++++++++++++- tools/python/xen/lowlevel/checkpoint/checkpoint.h | 2 + 4 files changed, 299 insertions(+), 4 deletions(-) diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c index cc4004a..870fea5 100644 --- a/tools/libxc/xc_domain_save.c +++ b/tools/libxc/xc_domain_save.c @@ -1645,6 +1645,15 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter } } + if ( callbacks->post_sendstate ) + { + if ( callbacks->post_sendstate(callbacks->data) < 0) + { + PERROR("Error: post_sendstate()\n"); + goto out; + } + } + /* Zero terminate */ i = 0; if ( wrexact(io_fd, &i, sizeof(int)) ) diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h index 709a284..04d2aaf 100644 --- a/tools/libxc/xenguest.h +++ b/tools/libxc/xenguest.h @@ -43,6 +43,9 @@ struct save_callbacks { /* Enable qemu-dm logging dirty pages to xen */ int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* HVM only */ + /* called before Zero terminate is sent */ + int (*post_sendstate)(void *data); + /* to be provided as the last argument to each callback function */ void* data; }; diff --git a/tools/python/xen/lowlevel/checkpoint/checkpoint.c b/tools/python/xen/lowlevel/checkpoint/checkpoint.c index 7545d7d..f880f1b 100644 --- a/tools/python/xen/lowlevel/checkpoint/checkpoint.c +++ b/tools/python/xen/lowlevel/checkpoint/checkpoint.c @@ -1,14 +1,22 @@ /* python bridge to checkpointing API */ #include <Python.h> +#include <sys/wait.h> #include <xs.h> #include <xenctrl.h> +#include <xc_private.h> +#include <xg_save_restore.h> #include "checkpoint.h" #define PKG "xen.lowlevel.checkpoint" +#define COMP_IOC_MAGIC ''k'' +#define COMP_IOCTWAIT _IO(COMP_IOC_MAGIC, 0) +#define COMP_IOCTFLUSH _IO(COMP_IOC_MAGIC, 1) +#define COMP_IOCTRESUME _IO(COMP_IOC_MAGIC, 2) + static PyObject* CheckpointError; typedef struct { @@ -24,11 +32,15 @@ typedef struct { PyObject* checkpoint_cb; PyThreadState* threadstate; + int colo; + int first_time; + int dev_fd; } CheckpointObject; static int suspend_trampoline(void* data); static int postcopy_trampoline(void* data); static int checkpoint_trampoline(void* data); +static int post_sendstate_trampoline(void *data); static PyObject* Checkpoint_new(PyTypeObject* type, PyObject* args, PyObject* kwargs) @@ -105,6 +117,7 @@ static PyObject* pycheckpoint_start(PyObject* obj, PyObject* args) { int fd; struct save_callbacks callbacks; int rc; + int flags = 0; if (!PyArg_ParseTuple(args, "O|OOOI", &iofile, &suspend_cb, &postcopy_cb, &checkpoint_cb, &interval)) @@ -151,9 +164,16 @@ static PyObject* pycheckpoint_start(PyObject* obj, PyObject* args) { } else self->checkpoint_cb = NULL; + if (flags & CHECKPOINT_FLAGS_COLO) + self->colo = 1; + else + self->colo = 0; + self->first_time = 1; + callbacks.suspend = suspend_trampoline; callbacks.postcopy = postcopy_trampoline; callbacks.checkpoint = checkpoint_trampoline; + callbacks.post_sendstate = post_sendstate_trampoline; callbacks.data = self; self->threadstate = PyEval_SaveThread(); @@ -258,6 +278,192 @@ PyMODINIT_FUNC initcheckpoint(void) { block_timer(); } +/* colo functions */ + +/* master slaver comment + * "continue" ===> + * <=== "suspend" guest is suspended + */ +static int notify_slaver_suspend(CheckpointObject *self) +{ + int fd = self->cps.fd; + + return write_exact(fd, "continue", 8); +} + +static int wait_slaver_suspend(CheckpointObject *self) +{ + int fd = self->cps.fd; + xc_interface *xch = self->cps.xch; + char buf[8]; + + if (self->first_time) { + self->first_time = 0; + return 0; + } + + if ( read_exact(fd, buf, 7) < 0) { + PERROR("read: suspend"); + return -1; + } + + buf[7] = ''\0''; + if (strcmp(buf, "suspend")) { + PERROR("read \"%s\", expect \"suspend\"", buf); + return -1; + } + + return 0; +} + +static int notify_slaver_start_checkpoint(CheckpointObject *self) +{ + int fd = self->cps.fd; + xc_interface *xch = self->cps.xch; + + if ( write_exact(fd, "start", 8) < 0) { + PERROR("write start"); + return -1; + } + + return 0; +} + +/* + * master slaver + * <==== "finish" + * flush packets + * "resume" ====> + * resume vm resume vm + * <==== "resume" + */ +static int notify_slaver_resume(CheckpointObject *self) +{ + int fd = self->cps.fd; + xc_interface *xch = self->cps.xch; + char buf[7]; + + /* wait slaver to finish update memory, device state... */ + if ( read_exact(fd, buf, 6) < 0) { + PERROR("read: finish"); + return -1; + } + + buf[6] = ''\0''; + if (strcmp(buf, "finish")) { + ERROR("read \"%s\", expect \"finish\"", buf); + return -1; + } + + if (!self->first_time) + /* flush queued packets now */ + ioctl(self->dev_fd, COMP_IOCTFLUSH); + + /* notify slaver to resume vm*/ + if (write_exact(fd, "resume", 6)) { + PERROR("write: resume"); + return -1; + } + + return 0; +} + +static int install_fw_network(CheckpointObject *self) +{ + pid_t pid; + xc_interface *xch = self->cps.xch; + int status; + int rc; + + pid = vfork(); + if (pid < 0) { + PERROR("vfork fails"); + return -1; + } + + if (pid > 0) { + rc = wait(&status); + if (rc != 0 || status != 0) { + ERROR("getting child status fails"); + return -1; + } + + return 0; + } + + execl("/etc/xen/scripts/HA_fw_runtime.sh", "HA_fw_runtime.sh", "install", NULL); + PERROR("execl fails"); + return -1; +} + +static int wait_slaver_resume(CheckpointObject *self) +{ + int fd = self->cps.fd; + xc_interface *xch = self->cps.xch; + char buf[7]; + + if (read_exact(fd, buf, 6) < 0) { + PERROR("read resume"); + return -1; + } + + buf[6] = ''\0''; + if (strcmp(buf, "resume")) { + ERROR("read \"%s\", expect \"resume\"", buf); + return -1; + } + + return 0; +} + +static int colo_postresume(CheckpointObject *self) +{ + int rc; + int dev_fd = self->dev_fd; + + rc = wait_slaver_resume(self); + if (rc < 0) + return rc; + + if (self->first_time) { + rc = install_fw_network(self); + if (rc < 0) + return rc; + } else { + ioctl(dev_fd, COMP_IOCTRESUME); + } + + return 0; +} + +static int pre_checkpoint(CheckpointObject *self) +{ + xc_interface *xch = self->cps.xch; + + if (!self->first_time) + return 0; + + self->dev_fd = open("/dev/HA_compare", O_RDWR); + if (self->dev_fd < 0) { + PERROR("opening /dev/HA_compare fails"); + return -1; + } + + return 0; +} + +static void wait_new_checkpoint(CheckpointObject *self) +{ + int dev_fd = self->dev_fd; + int err; + + while (1) { + err = ioctl(dev_fd, COMP_IOCTWAIT); + if (err == 0 || err == -1) + break; + } +} + /* private functions */ /* bounce C suspend call into python equivalent. @@ -268,6 +474,13 @@ static int suspend_trampoline(void* data) PyObject* result; + if (self->colo) { + if (notify_slaver_suspend(self) < 0) { + fprintf(stderr, "nofitying slaver suspend fails\n"); + return 0; + } + } + /* call default suspend function, then python hook if available */ if (self->armed) { if (checkpoint_wait(&self->cps) < 0) { @@ -286,8 +499,16 @@ static int suspend_trampoline(void* data) } } + /* suspend_cb() should be called after both sides are suspended */ + if (self->colo) { + if (wait_slaver_suspend(self) < 0) { + fprintf(stderr, "waiting slaver suspend fails\n"); + return 0; + } + } + if (!self->suspend_cb) - return 1; + goto start_checkpoint; PyEval_RestoreThread(self->threadstate); result = PyObject_CallFunction(self->suspend_cb, NULL); @@ -298,12 +519,24 @@ static int suspend_trampoline(void* data) if (result == Py_None || PyObject_IsTrue(result)) { Py_DECREF(result); - return 1; + goto start_checkpoint; } Py_DECREF(result); return 0; + +start_checkpoint: + if (self->colo) { + if (notify_slaver_start_checkpoint(self) < 0) { + fprintf(stderr, "nofitying slaver to start checkpoint fails\n"); + return 0; + } + + self->first_time = 0; + } + + return 1; } static int postcopy_trampoline(void* data) @@ -313,6 +546,13 @@ static int postcopy_trampoline(void* data) PyObject* result; int rc = 0; + if (self->colo) { + if (notify_slaver_resume(self) < 0) { + fprintf(stderr, "nofitying slaver resume fails\n"); + return 0; + } + } + if (!self->postcopy_cb) goto resume; @@ -331,6 +571,13 @@ static int postcopy_trampoline(void* data) return 0; } + if (self->colo) { + if (colo_postresume(self) < 0) { + fprintf(stderr, "postresume fails\n"); + return 0; + } + } + return rc; } @@ -345,8 +592,15 @@ static int checkpoint_trampoline(void* data) return -1; } + if (self->colo) { + if (pre_checkpoint(self) < 0) { + fprintf(stderr, "pre_checkpoint() fails\n"); + return -1; + } + } + if (!self->checkpoint_cb) - return 0; + goto wait_checkpoint; PyEval_RestoreThread(self->threadstate); result = PyObject_CallFunction(self->checkpoint_cb, NULL); @@ -357,10 +611,37 @@ static int checkpoint_trampoline(void* data) if (result == Py_None || PyObject_IsTrue(result)) { Py_DECREF(result); - return 1; + goto wait_checkpoint; } Py_DECREF(result); return 0; + +wait_checkpoint: + if (self->colo) { + wait_new_checkpoint(self); + } + + return 1; +} + +static int post_sendstate_trampoline(void* data) +{ + CheckpointObject *self = data; + int fd = self->cps.fd; + int i = XC_SAVE_ID_LAST_CHECKPOINT; + + if (!self->colo) + return 0; + + /* In colo mode, guest is running on slaver side, so we should + * send XC_SAVE_ID_LAST_CHECKPOINT to slaver. + */ + if (write_exact(fd, &i, sizeof(int)) < 0) { + fprintf(stderr, "writing XC_SAVE_ID_LAST_CHECKPOINT fails\n"); + return -1; + } + + return 0; } diff --git a/tools/python/xen/lowlevel/checkpoint/checkpoint.h b/tools/python/xen/lowlevel/checkpoint/checkpoint.h index 36455fb..5dd6440 100644 --- a/tools/python/xen/lowlevel/checkpoint/checkpoint.h +++ b/tools/python/xen/lowlevel/checkpoint/checkpoint.h @@ -40,6 +40,8 @@ typedef struct { timer_t timer; } checkpoint_state; +#define CHECKPOINT_FLAGS_COLO 2 + char* checkpoint_error(checkpoint_state* s); void checkpoint_init(checkpoint_state* s); -- 1.8.0
In colo mode, XendCheckpoit.py will communicate with both master and xc_restore. This patch implements this communication. In colo mode, the signature is "GuestColoRestore". Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/python/xen/xend/XendCheckpoint.py | 138 +++++++++++++++++++++++--------- 1 file changed, 101 insertions(+), 37 deletions(-) diff --git a/tools/python/xen/xend/XendCheckpoint.py b/tools/python/xen/xend/XendCheckpoint.py index fa09757..261d9d1 100644 --- a/tools/python/xen/xend/XendCheckpoint.py +++ b/tools/python/xen/xend/XendCheckpoint.py @@ -25,6 +25,7 @@ from xen.xend.XendConstants import * from xen.xend import XendNode SIGNATURE = "LinuxGuestRecord" +COLO_SIGNATURE = "GuestColoRestore" QEMU_SIGNATURE = "QemuDeviceModelRecord" dm_batch = 512 XC_SAVE = "xc_save" @@ -203,10 +204,15 @@ def restore(xd, fd, dominfo = None, paused = False, relocating = False): signature = read_exact(fd, len(SIGNATURE), "not a valid guest state file: signature read") - if signature != SIGNATURE: + if signature != SIGNATURE and signature != COLO_SIGNATURE: raise XendError("not a valid guest state file: found ''%s''" % signature) + if signature == COLO_SIGNATURE: + colo = True + else + colo = False + l = read_exact(fd, sizeof_int, "not a valid guest state file: config size read") vmconfig_size = unpack("!i", l)[0] @@ -305,6 +311,7 @@ def restore(xd, fd, dominfo = None, paused = False, relocating = False): log.debug("[xc_restore]: %s", string.join(cmd)) handler = RestoreInputHandler() + restore_handler = RestoreHandler(fd, colo, dominfo, inputHandler) forkHelper(cmd, fd, handler.handler, True) @@ -321,35 +328,9 @@ def restore(xd, fd, dominfo = None, paused = False, relocating = False): raise XendError(''Could not read store MFN'') if not is_hvm and handler.console_mfn is None: - raise XendError(''Could not read console MFN'') - - restore_image.setCpuid() - - # xc_restore will wait for source to close connection - - dominfo.completeRestore(handler.store_mfn, handler.console_mfn) - - # - # We shouldn''t hold the domains_lock over a waitForDevices - # As this function sometime gets called holding this lock, - # we must release it and re-acquire it appropriately - # - from xen.xend import XendDomain + raise XendError(''Could not read console MFN'') - lock = True; - try: - XendDomain.instance().domains_lock.release() - except: - lock = False; - - try: - dominfo.waitForDevices() # Wait for backends to set up - finally: - if lock: - XendDomain.instance().domains_lock.acquire() - - if not paused: - dominfo.unpause() + restorehandler.resume(True, paused, None) return dominfo except Exception, exn: @@ -358,23 +339,106 @@ def restore(xd, fd, dominfo = None, paused = False, relocating = False): raise exn +class RestoreHandler: + def __init__(self, fd, colo, dominfo, inputHandler): + self.fd = fd + self.colo = colo + self.firsttime = True + self.inputHandler = inputHandler + self.dominfo = dominfo + + def resume(self, finish, paused, child): + fd = self.fd + dominfo = self.dominfo + handler = self.inputHandler + restore_image.setCpuid() + dominfo.completeRestore(handler.store_mfn, handler.console_mfn) + + if self.colo and not finish: + # notify master that checkpoint finishes + write_exact(fd, "finish", "failed to write finish done") + buf = read_exact(fd, 6, "failed to read resume flag") + if buf != "resume": + return False + + from xen.xend import XendDomain + + if self.firsttime: + lock = True; + try: + XendDomain.instance().domains_lock.release() + except: + lock = False; + + try: + dominfo.waitForDevices() # Wait for backends to set up + finally: + if lock: + XendDomain.instance().domains_lock.acquire() + if not paused: + dominfo.unpause() + else: + # colo + xc.domain_resume(dominfo.domid, 0) + ResumeDomain(dominfo.domid) + + if self.colo and not finish: + child.tochild.write("resume\n") + child.tochild.flush() + buf = child.fromchild.readline() + if buf != "resume\n": + return False + if self.firsttime: + util.runcmd("/etc/xen/scripts/HA_fw_runtime.sh slaver") + # notify master side VM resumed + write_exact(fd, "resume", "failed to write resume done"); + + # wait new checkpoint + buf = read_exact(fd, 8, "failed to read continue flag") + if buf != "continue": + return False + + child.tochild.write("suspend\n") + buf = child.fromchild.readline() + if buf != "suspend\n": + return False + + # notify master side suspend done. + write_exact(fd, "suspend", "failed to write suspend done") + buf = read_exact(fd, 5, "failed to read start flag") + if buf != "start": + return False + + child.tochild.write("start\n") + child.tochild.flush() + + self.firsttime = False + class RestoreInputHandler: - def __init__(self): + def __init__(self, colo): self.store_mfn = None self.console_mfn = None - def handler(self, line, _): + def handler(self, line, child, restorehandler): + if line == "finish\n": + # colo + return restorehandler.resume(False, False, child) + m = re.match(r"^(store-mfn) (\d+)$", line) if m: self.store_mfn = int(m.group(2)) - else: - m = re.match(r"^(console-mfn) (\d+)$", line) - if m: - self.console_mfn = int(m.group(2)) + return True + + m = re.match(r"^(console-mfn) (\d+)$", line) + if m: + self.console_mfn = int(m.group(2)) + return True + + return False -def forkHelper(cmd, fd, inputHandler, closeToChild): +def forkHelper(cmd, fd, inputHandler, closeToChild, restorehandler): child = xPopen3(cmd, True, -1, [fd]) if closeToChild: @@ -392,7 +456,7 @@ def forkHelper(cmd, fd, inputHandler, closeToChild): else: line = line.rstrip() log.debug(''%s'', line) - inputHandler(line, child.tochild) + inputHandler(line, child, restorehandler) except IOError, exn: raise XendError(''Error reading from child process for %s: %s'' % -- 1.8.0
Add a new option --colo to the command remus. We will ignore the options: --time, -i, --no-net when --colo is specified. In colo mode, we will write new signature "GuestColoRestore". If the xen-tool in secondary machine does not support colo, it will reject this signature, and the command remus will fail. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/python/xen/remus/image.py | 7 +++++-- tools/python/xen/remus/save.py | 6 ++++-- tools/remus/remus | 8 +++++++- 3 files changed, 16 insertions(+), 5 deletions(-) diff --git a/tools/python/xen/remus/image.py b/tools/python/xen/remus/image.py index b79d1e5..0927314 100644 --- a/tools/python/xen/remus/image.py +++ b/tools/python/xen/remus/image.py @@ -189,9 +189,12 @@ def parseheader(header): "parses a header sexpression" return vm.parsedominfo(vm.strtosxpr(header)) -def makeheader(dominfo): +def makeheader(dominfo, colo): "create an image header from a VM dominfo sxpr" - items = [SIGNATURE] + if colo: + items = [COLO_SIGNATURE] + else: + items = [SIGNATURE] sxpr = vm.sxprtostr(dominfo) items.append(struct.pack(''!i'', len(sxpr))) items.append(sxpr) diff --git a/tools/python/xen/remus/save.py b/tools/python/xen/remus/save.py index 71517da..5157153 100644 --- a/tools/python/xen/remus/save.py +++ b/tools/python/xen/remus/save.py @@ -127,7 +127,7 @@ class Keepalive(object): class Saver(object): def __init__(self, domid, fd, suspendcb=None, resumecb=None, - checkpointcb=None, interval=0): + checkpointcb=None, interval=0, colo): """Create a Saver object for taking guest checkpoints. domid: name, number or UUID of a running domain fd: a stream to which checkpoint data will be written. @@ -135,12 +135,14 @@ class Saver(object): resumecb: callback invoked before guest resumes checkpointcb: callback invoked when a checkpoint is complete. Return True to take another checkpoint, or False to stop. + colo: use colo mode """ self.fd = fd self.suspendcb = suspendcb self.resumecb = resumecb self.checkpointcb = checkpointcb self.interval = interval + self.colo = colo self.vm = vm.VM(domid) @@ -149,7 +151,7 @@ class Saver(object): def start(self): vm.getshadowmem(self.vm) - hdr = image.makeheader(self.vm.dominfo) + hdr = image.makeheader(self.vm.dominfo, self.colo) self.fd.write(hdr) self.fd.flush() diff --git a/tools/remus/remus b/tools/remus/remus index 11d83e4..34c200f 100644 --- a/tools/remus/remus +++ b/tools/remus/remus @@ -37,6 +37,8 @@ class Cfg(object): help=''run without net buffering (benchmark option)'') parser.add_option('''', ''--timer'', dest=''timer'', action=''store_true'', help=''force pause at checkpoint interval (experimental)'') + parser.add_option('''', ''--colo'', dest=''colo'', action=''store_true'', + help=''use colo checkpointing (experimental)'') self.parser = parser def usage(self): @@ -53,6 +55,10 @@ class Cfg(object): self.netbuffer = False if opts.timer: self.timer = True + if opts.colo: + self.interval = 0 + self.netbuffer = False + self.timer = True if not args: raise CfgException(''Missing domain'') @@ -181,7 +187,7 @@ def run(cfg): rc = 0 checkpointer = save.Saver(cfg.domid, fd, postsuspend, preresume, commit, - interval) + interval, cfg.colo) try: checkpointer.start() -- 1.8.0
George Dunlap
2013-Apr-03 11:44 UTC
Re: [RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
On Wed, Apr 3, 2013 at 9:02 AM, Wen Congyang <wency@cn.fujitsu.com> wrote:> Virtual machine (VM) replication is a well known technique for providing > application-agnostic software-implemented hardware fault tolerance - > "non-stop service". Currently, remus provides this function, but it buffers > all output packets, and the latency is unacceptable.Just FYI, as we''re in a feature freeze this can''t be accepted until the 4.3 release sometime in June; and since in the mean time we''ll be trying to get other features sorted and bugs fixed, you may not get much review time until then. -George
Hi, At 16:02 +0800 on 03 Apr (1365004959), Wen Congyang wrote:> + /* reset memory */ > + hypercall.op = __HYPERVISOR_reset_memory_op; > + hypercall.arg[0] = (unsigned long)dom; > + do_xen_hypercall(xch, &hypercall);You''ve added a new hypercall here but I don''t see any implementation (or documentation). Are there some xen-side patches missing? Cheers, Tim.> @@ -93,6 +93,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_pfn_t); > #define __HYPERVISOR_domctl 36 > #define __HYPERVISOR_kexec_op 37 > #define __HYPERVISOR_tmem_op 38 > +#define __HYPERVISOR_reset_memory_op 40 > > /* Architecture-specific hypercall definitions. */ > #define __HYPERVISOR_arch_0 48 > -- > 1.8.0 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Shriram Rajagopalan
2013-Apr-05 03:55 UTC
Re: [RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
FYI, it would be nice to cc the maintainer when you submit such a major functionality change, especially for the xend code base that has so far been the only stable workable remus solution. On Wed, Apr 3, 2013 at 6:44 AM, George Dunlap <George.Dunlap@eu.citrix.com>wrote:> On Wed, Apr 3, 2013 at 9:02 AM, Wen Congyang <wency@cn.fujitsu.com> wrote: > > Virtual machine (VM) replication is a well known technique for providing > > application-agnostic software-implemented hardware fault tolerance - > > "non-stop service". Currently, remus provides this function, but it > buffers > > all output packets, and the latency is unacceptable. > > Just FYI, as we''re in a feature freeze this can''t be accepted until > the 4.3 release sometime in June; and since in the mean time we''ll be > trying to get other features sorted and bugs fixed, you may not get > much review time until then. > > -George > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Shriram Rajagopalan
2013-Apr-05 05:06 UTC
Re: [RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
On Wed, Apr 3, 2013 at 3:02 AM, Wen Congyang <wency@cn.fujitsu.com> wrote:> Virtual machine (VM) replication is a well known technique for providing > application-agnostic software-implemented hardware fault tolerance - > "non-stop service". Currently, remus provides this function, but it buffers > all output packets, and the latency is unacceptable. > > In xen summit 2012, We introduce a new VM replication solution: colo > (COarse-grain LOck-stepping virtual machine). The presentation is in > the following URL: > > http://www.slideshare.net/xen_com_mgr/colo-coarsegrain-lockstepping-virtual-machines-for-nonstop-service > > Here is the summary of the solution: > From the client''s point of view, as long as the client observes identical > responses from the primary and secondary VMs, according to the service > semantics, then the secondary VM(SVM) is a valid replica of the primary > VM(PVM), and can successfully take over when a hardware failure of the > PVM is detected. > > This patchset is RFC, and implements the frame of colo: > 1. Both PVM and SVM are running > 2. Forward the input packets from client to secondary machine(slaver) > 3. Forward the output packets from SVM to primary machine(master) > 4. Compare the output packets from PVM and SVM on the master side. If the > output packets are different, do a checkpoint > >I skimmed through the presentation. Interesting approach. It would be nice to have the performance report that is mentioned in the slides, available, so that we can understand the exact setups for the benchmarks. A few quick thoughts after looking at the presentation: 0. I am not completely sold on the type of applications you have used to benchmark the system. They seem stateless and dont have much of memory churn (dirty pages/epoch). It would be nice to benchmark your system against something more realistic, like the DVDStore benchmark or percona-tools'' TPCC benchmark with MySQL. [The clients have to be outside the system, mind you]. And finally, something like Specweb2005, where there are about 1000 dirty pages per 25ms epoch. I care more about how many concurrent connections was handled by the server and how frequently did you have to synchronize between the machines. 1. The checkpoints are going to be very costly. If you are doing coarse grained locking and assuming that checkpoint are triggered every second, you would probably lose all benefits of checkpoint compression. Also your working set would have grown considerably large. You will inevitably end up taking the slow path, where you suspend the VM, "synchronously" send a ton of pages over the network (on the order of 10-100s of megabytes), and then resume the VM. Replicating this checkpoint is going to take a long time and will screw performance. The usual fast path has a small buffer (16/32 MB) to copy out the dirty pages to the buffer, and asynchronously transmit it to the backup. 2. Whats the story for the DISK? The slides show that the VMs share a SAN disk. And if both primary and secondary are operational, whose packets are you going to discard, in a programmatic manner ? While you have a FTP server benchmark, it doesnt demonstrate output consistency. I would suggest you run something like DVDStore (from Dell) or some simple mysql TPCC and see if the clients raise a hue and cry about data corruption. ;) 3. What happens if one is running faster than the other ? Lets say the application does a bunch of read/write (dependent) to/from the SAN. each write depends on the output of the previous read. And the writes are non-deterministic (i.e. differ between primary and secondary). Wont this system end up in perpetual synchronization, since the outputs from primary and backup would be different, causing a checkpoint again and again ? And I would like to see atleast some *informal* guarantees of data consistency - might sound academic but when you are talking about putting critical customer applications like a MySQL database, a SAP server or an e-commerce web app, "consistency" matters! It helps to convince people that this system is not some half baked experiment but something that is well thought of. Once again, please CC me on the patches. Several files you have touched belong to remus code and the MAINTAINERS file has the maintainer info. Nit: in one of your slides, you mentioned (75 ms/checkpoint, of which 2/3rds was spend in suspend/resume). That isnt an artifact of Remus, FYI. I have run remus at 20ms checkpoint interval, where VMs were suspended, checkpointed and resumed in under 2ms. With the addition of a ton of functionality -- both at the toolstack and in the guest kernel -- the suspend resume times have gone up considerably. If you want to reduce that overhead, try a SuSe based kernel that has suspend-event channel support. You may not need any of those lazy netifs/netups etc. Even with that, the new power management framework in the 3.* kernels seem to have made suspend/resume pretty slow. thanks shriram Changelog:> Patch 1: optimize the dirty pages transfer speed. > Patch 2-3: allow SVM running after checkpoint > Patch 4-5: modification for colo on the master side(wait a new > checkpoint, > communicate with slaver when doing checkoint) > Patch 6-7: implement colo''s user interface > > Wen Congyang (7): > xc_domain_save: cache pages mapping > xc_domain_restore: introduce restore_callbacks for colo > colo: implement restore_callbacks > xc_domain_save: flush cache before calling callbacks->postcopy() > xc_domain_save: implement save_callbacks for colo > XendCheckpoint: implement colo > remus: implement colo mode > > tools/libxc/Makefile | 4 +- > tools/libxc/ia64/xc_ia64_linux_restore.c | 3 +- > tools/libxc/xc_domain_restore.c | 256 +++++--- > tools/libxc/xc_domain_restore_colo.c | 740 > ++++++++++++++++++++++ > tools/libxc/xc_domain_save.c | 162 +++-- > tools/libxc/xc_save_restore_colo.h | 44 ++ > tools/libxc/xenguest.h | 57 +- > tools/libxl/libxl_dom.c | 2 +- > tools/python/xen/lowlevel/checkpoint/checkpoint.c | 289 ++++++++- > tools/python/xen/lowlevel/checkpoint/checkpoint.h | 2 + > tools/python/xen/remus/image.py | 7 +- > tools/python/xen/remus/save.py | 6 +- > tools/python/xen/xend/XendCheckpoint.py | 138 ++-- > tools/remus/remus | 8 +- > tools/xcutils/xc_restore.c | 3 +- > xen/include/public/xen.h | 1 + > 16 files changed, 1503 insertions(+), 219 deletions(-) > create mode 100644 tools/libxc/xc_domain_restore_colo.c > create mode 100644 tools/libxc/xc_save_restore_colo.h > > -- > 1.8.0 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2013-Apr-11 13:55 UTC
Re: [RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
On Fri, 2013-04-05 at 04:55 +0100, Shriram Rajagopalan wrote:> the xend code base that has so far been the only stable workable remus > solution.What is the status of xl remus at the minute? Is it being actively worked on? Ian