Wen Congyang
2013-Jul-11 08:35 UTC
[RFC Patch v2 00/16] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
Virtual machine (VM) replication is a well known technique for providing application-agnostic software-implemented hardware fault tolerance - "non-stop service". Currently, remus provides this function, but it buffers all output packets, and the latency is unacceptable. In xen summit 2012, We introduce a new VM replication solution: colo (COarse-grain LOck-stepping virtual machine). The presentation is in the following URL: http://www.slideshare.net/xen_com_mgr/colo-coarsegrain-lockstepping-virtual-machines-for-nonstop-service Here is the summary of the solution:>From the client''s point of view, as long as the client observes identicalresponses from the primary and secondary VMs, according to the service semantics, then the secondary VM(SVM) is a valid replica of the primary VM(PVM), and can successfully take over when a hardware failure of the PVM is detected. This patchset is RFC, and implements the frame of colo: 1. Both PVM and SVM are running 2. do checkpoint only when the output packets from PVM and SVM are different 3. cache write requests from SVM ChangeLog from v1 to v2: 1. update block-remus to support colo 2. split large patch to small one 3. fix some bugs 4. add a new hypercall for colo Changelog: Patch 1: optimize the dirty pages transfer speed. Patch 2-3: allow SVM running after checkpoint Patch 4-5: modification for colo on the master side(wait a new checkpoint, communicate with slaver when doing checkoint) Patch 6-7: implement colo''s user interface Wen Congyang (16): xen: introduce new hypercall to reset vcpu block-remus: introduce colo mode block-remus: introduce a interface to allow the user specify which mode the backup end uses dominfo.completeRestore() will be called more than once in colo mode xc_domain_restore: introduce restore_callbacks for colo colo: implement restore_callbacks init()/free() colo: implement restore_callbacks get_page() colo: implement restore_callbacks flush_memory colo: implement restore_callbacks update_p2m() colo: implement restore_callbacks finish_restore() xc_restore: implement for colo XendCheckpoint: implement colo xc_domain_save: flush cache before calling callbacks->postcopy() add callback to configure network for colo xc_domain_save: implement save_callbacks for colo remus: implement colo mode tools/blktap2/drivers/block-remus.c | 188 ++++- tools/libxc/Makefile | 8 +- tools/libxc/xc_domain_restore.c | 264 ++++-- tools/libxc/xc_domain_restore_colo.c | 939 +++++++++++++++++++++ tools/libxc/xc_domain_save.c | 23 +- tools/libxc/xc_save_restore_colo.h | 14 + tools/libxc/xenguest.h | 51 ++ tools/libxl/Makefile | 2 +- tools/python/xen/lowlevel/checkpoint/checkpoint.c | 322 +++++++- tools/python/xen/lowlevel/checkpoint/checkpoint.h | 1 + tools/python/xen/remus/device.py | 8 + tools/python/xen/remus/image.py | 8 +- tools/python/xen/remus/save.py | 13 +- tools/python/xen/xend/XendCheckpoint.py | 127 ++- tools/python/xen/xend/XendDomainInfo.py | 13 +- tools/remus/remus | 28 +- tools/xcutils/Makefile | 4 +- tools/xcutils/xc_restore.c | 36 +- xen/arch/x86/domain.c | 57 ++ xen/arch/x86/x86_64/entry.S | 4 + xen/include/public/xen.h | 1 + 21 files changed, 1947 insertions(+), 164 deletions(-) create mode 100644 tools/libxc/xc_domain_restore_colo.c create mode 100644 tools/libxc/xc_save_restore_colo.h -- 1.7.4
Wen Congyang
2013-Jul-11 08:35 UTC
[RFC Patch v2 01/16] xen: introduce new hypercall to reset vcpu
In colo mode, SVM is running, and it will create pagetable, use gdt... When we do a new checkpoint, we may need to rollback all this operations. This new hypercall will do this. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- xen/arch/x86/domain.c | 57 +++++++++++++++++++++++++++++++++++++++++++ xen/arch/x86/x86_64/entry.S | 4 +++ xen/include/public/xen.h | 1 + 3 files changed, 62 insertions(+), 0 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 874742c..709f77f 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -1930,6 +1930,63 @@ int domain_relinquish_resources(struct domain *d) return 0; } +int do_reset_vcpu_op(unsigned long domid) +{ + struct vcpu *v; + struct domain *d; + int ret; + + if ( domid == DOMID_SELF ) + /* We can''t destroy outself pagetables */ + return -EINVAL; + + if ( (d = rcu_lock_domain_by_id(domid)) == NULL ) + return -EINVAL; + + BUG_ON(!cpumask_empty(d->domain_dirty_cpumask)); + domain_pause(d); + + if ( d->arch.relmem == RELMEM_not_started ) + { + for_each_vcpu ( d, v ) + { + /* Drop the in-use references to page-table bases. */ + ret = vcpu_destroy_pagetables(v); + if ( ret ) + return ret; + + unmap_vcpu_info(v); + v->is_initialised = 0; + } + + if ( !is_hvm_domain(d) ) + { + for_each_vcpu ( d, v ) + { + /* + * Relinquish GDT mappings. No need for explicit unmapping of the + * LDT as it automatically gets squashed with the guest mappings. + */ + destroy_gdt(v); + } + + if ( d->arch.pv_domain.pirq_eoi_map != NULL ) + { + unmap_domain_page_global(d->arch.pv_domain.pirq_eoi_map); + put_page_and_type( + mfn_to_page(d->arch.pv_domain.pirq_eoi_map_mfn)); + d->arch.pv_domain.pirq_eoi_map = NULL; + d->arch.pv_domain.auto_unmask = 0; + } + } + } + + domain_unpause(d); + rcu_unlock_domain(d); + + return 0; +} + void arch_dump_domain_info(struct domain *d) { paging_dump_domain_info(d); diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S index 5beeccb..0e4dde4 100644 --- a/xen/arch/x86/x86_64/entry.S +++ b/xen/arch/x86/x86_64/entry.S @@ -762,6 +762,8 @@ ENTRY(hypercall_table) .quad do_domctl .quad do_kexec_op .quad do_tmem_op + .quad do_ni_hypercall /* reserved for XenClient */ + .quad do_reset_vcpu_op /* 40 */ .rept __HYPERVISOR_arch_0-((.-hypercall_table)/8) .quad do_ni_hypercall .endr @@ -810,6 +812,8 @@ ENTRY(hypercall_args_table) .byte 1 /* do_domctl */ .byte 2 /* do_kexec */ .byte 1 /* do_tmem_op */ + .byte 0 /* do_ni_hypercall */ + .byte 1 /* do_reset_vcpu_op */ /* 40 */ .rept __HYPERVISOR_arch_0-(.-hypercall_args_table) .byte 0 /* do_ni_hypercall */ .endr diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h index 3cab74f..696f4a3 100644 --- a/xen/include/public/xen.h +++ b/xen/include/public/xen.h @@ -101,6 +101,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_ulong_t); #define __HYPERVISOR_kexec_op 37 #define __HYPERVISOR_tmem_op 38 #define __HYPERVISOR_xc_reserved_op 39 /* reserved for XenClient */ +#define __HYPERVISOR_reset_vcpu_op 40 /* Architecture-specific hypercall definitions. */ #define __HYPERVISOR_arch_0 48 -- 1.7.4
In colo mode, SVM is running, so we can''t use mode_backup for colo. Introduce a new mode: mode_cole, to do it. write: cache all write requests in ramdisk.local read: first, try to read it from ramdisk. If SVM doesn''t modify this sector, read it from disk file. flush: drop all cached write requests, flush the request from master into disk file when doing checkpoint. The PVM uses mode_primary. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/blktap2/drivers/block-remus.c | 139 ++++++++++++++++++++++++++++++++++- 1 files changed, 137 insertions(+), 2 deletions(-) diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c index 079588d..bced0e9 100644 --- a/tools/blktap2/drivers/block-remus.c +++ b/tools/blktap2/drivers/block-remus.c @@ -57,6 +57,7 @@ #include <sys/sysctl.h> #include <unistd.h> #include <sys/stat.h> +#include <stdbool.h> /* timeout for reads and writes in ms */ #define HEARTBEAT_MS 1000 @@ -71,7 +72,8 @@ enum tdremus_mode { mode_invalid = 0, mode_unprotected, mode_primary, - mode_backup + mode_backup, + mode_colo }; struct tdremus_req { @@ -121,6 +123,14 @@ struct ramdisk { * */ struct hashtable* inprogress; + + /* local holds the requests from backup vm. + * If we flush the requests hold in h, we will drop all requests in + * local. + * If we switch to unprotected mode, all requests in local should be + * flushed to disk. + */ + struct hashtable* local; }; /* the ramdisk intercepts the original callback for reads and writes. @@ -1195,6 +1205,126 @@ static int backup_start(td_driver_t *driver) return 0; } +static int ramdisk_read_colo(struct ramdisk* ramdisk, uint64_t sector, + int nb_sectors, char* buf) +{ + int i; + char* v; + uint64_t key; + + for (i = 0; i < nb_sectors; i++) { + key = sector + i; + /* check whether it is queued in a previous flush request */ + if (!(v = hashtable_search(ramdisk->local, &key))) + return -1; + memcpy(buf + i * ramdisk->sector_size, v, ramdisk->sector_size); + } + + return 0; +} + +static void colo_queue_read(td_driver_t *driver, td_request_t treq) +{ + struct tdremus_state *s = (struct tdremus_state *)driver->data; + int i; + if(!remus_image) + remus_image = treq.image; + + /* check if this read is queued in any currently ongoing flush */ + if (ramdisk_read_colo(&s->ramdisk, treq.sec, treq.secs, treq.buf)) { + /* TODO: Add to pending read hash */ + td_forward_request(treq); + } else { + /* complete the request */ + td_complete_request(treq, 0); + } +} + +static inline int ramdisk_write_colo(struct ramdisk* ramdisk, uint64_t sector, + int nb_sectors, char* buf) +{ + int i, rc; + + for (i = 0; i < nb_sectors; i++) { + rc = ramdisk_write_hash(ramdisk->local, sector + i, + buf + i * ramdisk->sector_size, + ramdisk->sector_size); + if (rc) + return rc; + } + + return 0; +} + +static void colo_queue_write(td_driver_t *driver, td_request_t treq) +{ + struct tdremus_state *s = (struct tdremus_state *)driver->data; + + if (ramdisk_write_colo(&s->ramdisk, treq.sec, treq.secs, treq.buf) < 0) + td_complete_request(treq, -EBUSY); + else + td_complete_request(treq, 0); +} + +/* flush_local: + * true: we have switched to unprotected mode, so all queued requests in h + * should be dropped. + * false: all queued requests in local should be dropped, and all queued + * requests in h should be flushed. + * + */ +static int ramdisk_start_flush_colo(td_driver_t *driver, bool flush_local) +{ + struct tdremus_state *s = (struct tdremus_state *)driver->data; + + if (flush_local) { + if (s->ramdisk.h) { + hashtable_destroy(s->ramdisk.h, 1); + s->ramdisk.h = NULL; + } + if (s->ramdisk.local) { + s->ramdisk.h = s->ramdisk.local; + s->ramdisk.local = NULL; + } + } else if (s->ramdisk.local){ + hashtable_destroy(s->ramdisk.local, 1); + s->ramdisk.local = create_hashtable(RAMDISK_HASHSIZE, + uint64_hash, + rd_hash_equal); + } + + return ramdisk_start_flush(driver); +} + +/* This function will be called when we switch to unprotected mode. In this + * case, we should flush queued request in prev and local. + */ +static int colo_flush(td_driver_t *driver) +{ + struct tdremus_state *s = (struct tdremus_state *)driver->data; + + ramdisk_start_flush_colo(driver, 1); + + /* all queued requests should be flushed are in prev now, so we can + * use server_flush to do flush. + */ + s->queue_flush = server_flush; + return 0; +} + +static int colo_start(td_driver_t *driver) +{ + struct tdremus_state *s = (struct tdremus_state *)driver->data; + + /* colo mode is switched from backup mode */ + s->ramdisk.local = create_hashtable(RAMDISK_HASHSIZE, uint64_hash, + rd_hash_equal); + tapdisk_remus.td_queue_read = colo_queue_read; + tapdisk_remus.td_queue_write = colo_queue_write; + s->queue_flush = colo_flush; + return 0; +} + static int server_do_wreq(td_driver_t *driver) { struct tdremus_state *s = (struct tdremus_state *)driver->data; @@ -1255,7 +1385,10 @@ static int server_do_creq(td_driver_t *driver) // RPRINTF("committing buffer\n"); - ramdisk_start_flush(driver); + if (s->mode == mode_colo) + ramdisk_start_flush_colo(driver, 0); + else + ramdisk_start_flush(driver); /* XXX this message should not be sent until flush completes! */ if (write(s->stream_fd.fd, TDREMUS_DONE, strlen(TDREMUS_DONE)) != 4) @@ -1470,6 +1603,8 @@ static int switch_mode(td_driver_t *driver, enum tdremus_mode mode) rc = primary_start(driver); else if (mode == mode_backup) rc = backup_start(driver); + else if (mode == mode_colo) + rc = colo_start(driver); else { RPRINTF("unknown mode requested: %d\n", mode); rc = -1; -- 1.7.4
Wen Congyang
2013-Jul-11 08:35 UTC
[RFC Patch v2 03/16] block-remus: introduce a interface to allow the user specify which mode the backup end uses
block-remus can be used for remus and colo, so we should introduce a way to tell block-remus which mode it should use: write mode to /var/run/tap/remus_xxx(control file): 1. ''r'': remus 2. ''c'': colo The mode should be writen to control file before we write any other command to control file. The master side will write TDREMUS_COLO to slaver side to tell it use colo mode. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/blktap2/drivers/block-remus.c | 49 +++++++++++++++++++++++++++++++++++ tools/python/xen/remus/device.py | 8 +++++ tools/remus/remus | 1 + 3 files changed, 58 insertions(+), 0 deletions(-) diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c index bced0e9..a85f5a0 100644 --- a/tools/blktap2/drivers/block-remus.c +++ b/tools/blktap2/drivers/block-remus.c @@ -188,6 +188,9 @@ struct tdremus_state { /* mode methods */ enum tdremus_mode mode; int (*queue_flush)(td_driver_t *driver); + + /* init data */ + int init_state; /* 0: init, 1: remus, 2: colo */ }; typedef struct tdremus_wire { @@ -201,6 +204,7 @@ typedef struct tdremus_wire { #define TDREMUS_WRITE "wreq" #define TDREMUS_SUBMIT "sreq" #define TDREMUS_COMMIT "creq" +#define TDREMUS_COLO "colo" #define TDREMUS_DONE "done" #define TDREMUS_FAIL "fail" @@ -786,6 +790,33 @@ static int primary_do_connect(struct tdremus_state *state) return 0; } +static void read_state(struct tdremus_state *s) +{ + int rc; + char state; + + rc = read(s->ctl_fd.fd, &state, 1); + if (rc <= 0) + return; + + if (state == ''r'') { + s->init_state = 1; + } else if (state == ''c'') { + s->init_state = 2; + } else { + RPRINTF("read unknown state: %d, use remus\n", (int)state); + s->init_state = 1; + } +} + +static void start_remus(struct tdremus_state *s) +{ + if (mwrite(s->stream_fd.fd, TDREMUS_COLO, strlen(TDREMUS_COLO)) < 0) { + RPRINTF("error start colo mode"); + exit(1); + } +} + static int primary_blocking_connect(struct tdremus_state *state) { int fd; @@ -835,6 +866,18 @@ static int primary_blocking_connect(struct tdremus_state *state) state->stream_fd.fd = fd; state->stream_fd.id = id; + + /* The user runs the remus command after we try to connect backup end */ + if (!state->init_state) + read_state(state); + + if (!state->init_state) { + RPRINTF("read state failed, try to use remus\n"); + state->init_state = 1; + } + + if (state->init_state == 2) + start_remus(state); return 0; } @@ -1424,6 +1467,8 @@ static void remus_server_event(event_id_t id, char mode, void *private) server_do_sreq(driver); else if (!strcmp(req, TDREMUS_COMMIT)) server_do_creq(driver); + else if (!strcmp(req, TDREMUS_COLO)) + switch_mode(driver, mode_colo); else RPRINTF("unknown request received: %s\n", req); @@ -1624,6 +1669,10 @@ static void ctl_request(event_id_t id, char mode, void *private) int rc; // RPRINTF("data waiting on control fifo\n"); + if (!s->init_state) { + read_state(s); + return; + } if (!(rc = read(s->ctl_fd.fd, msg, sizeof(msg) - 1 /* append nul */))) { RPRINTF("0-byte read received, reopening FIFO\n"); diff --git a/tools/python/xen/remus/device.py b/tools/python/xen/remus/device.py index 970e1ea..bbb1cd8 100644 --- a/tools/python/xen/remus/device.py +++ b/tools/python/xen/remus/device.py @@ -12,6 +12,10 @@ class BufferedNICException(Exception): pass class CheckpointedDevice(object): ''Base class for buffered devices'' + def init(self, mode): + ''init device state, only called once'' + pass + def postsuspend(self): ''called after guest has suspended'' pass @@ -79,6 +83,10 @@ class ReplicatedDisk(CheckpointedDevice): def __del__(self): self.uninstall() + def init(self, mode): + if self.ctlfd: + os.write(self.ctlfd.fileno(), mode) + def uninstall(self): if self.ctlfd: self.ctlfd.close() diff --git a/tools/remus/remus b/tools/remus/remus index 38f0365..d5178cd 100644 --- a/tools/remus/remus +++ b/tools/remus/remus @@ -124,6 +124,7 @@ def run(cfg): for disk in dom.disks: try: bufs.append(ReplicatedDisk(disk)) + disk.init(''r'') except ReplicatedDiskException, e: print e continue -- 1.7.4
Wen Congyang
2013-Jul-11 08:35 UTC
[RFC Patch v2 04/16] dominfo.completeRestore() will be called more than once in colo mode
The SVM is running in colo mode, so we will call dominfo.completeRestore() more than once. Some works in dominfo.completeRestore() should be done only once. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/python/xen/xend/XendDomainInfo.py | 13 +++++++------ 1 files changed, 7 insertions(+), 6 deletions(-) diff --git a/tools/python/xen/xend/XendDomainInfo.py b/tools/python/xen/xend/XendDomainInfo.py index e9d3e7e..b5b2db9 100644 --- a/tools/python/xen/xend/XendDomainInfo.py +++ b/tools/python/xen/xend/XendDomainInfo.py @@ -3011,18 +3011,19 @@ class XendDomainInfo: # TODO: recategorise - called from XendCheckpoint # - def completeRestore(self, store_mfn, console_mfn): + def completeRestore(self, store_mfn, console_mfn, first_time = True): log.debug("XendDomainInfo.completeRestore") self.store_mfn = store_mfn self.console_mfn = console_mfn - self._introduceDomain() - self.image = image.create(self, self.info) - if self.image: - self.image.createDeviceModel(True) - self._storeDomDetails() + if first_time: + self._introduceDomain() + self.image = image.create(self, self.info) + if self.image: + self.image.createDeviceModel(True) + self._storeDomDetails() self._registerWatches() self.refreshShutdown() -- 1.7.4
Wen Congyang
2013-Jul-11 08:35 UTC
[RFC Patch v2 05/16] xc_domain_restore: introduce restore_callbacks for colo
In colo mode, SVM also runs. So we should update xc_restore to support it. The first step is: add some callbacks for colo. We add the following callbacks: 1. init(): init the private data used for colo 2. free(): free the resource we allocate and store in the private data 3. get_page(): SVM runs, so we can''t update the memory in apply_batch(). This callback will return a page buffer, and apply_batch() will copy the page to this buffer. The content of this buffer should be the current content of this page, so we can use it to do verify. 4. flush_memory(): update the SVM memory and pagetable. 5. update_p2m(): update the SVM p2m page. 6. finish_restore(): wait a new checkpoint. We also add a new structure restore_data to avoid pass too many arguments to these callbacks. This structure stores the variables used in xc_domain_store(), and these variables will be used in the callback. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/libxc/xc_domain_restore.c | 264 ++++++++++++++++++++++++++------------- tools/libxc/xenguest.h | 48 +++++++ 2 files changed, 225 insertions(+), 87 deletions(-) diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c index 63d36cd..aac2de0 100644 --- a/tools/libxc/xc_domain_restore.c +++ b/tools/libxc/xc_domain_restore.c @@ -1076,7 +1076,8 @@ static int pagebuf_get(xc_interface *xch, struct restore_ctx *ctx, static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx, xen_pfn_t* region_mfn, unsigned long* pfn_type, int pae_extended_cr3, struct xc_mmu* mmu, - pagebuf_t* pagebuf, int curbatch) + pagebuf_t* pagebuf, int curbatch, + struct restore_callbacks *callbacks) { int i, j, curpage, nr_mfns; int k, scount; @@ -1085,6 +1086,7 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx, unsigned long buf[PAGE_SIZE/sizeof(unsigned long)]; /* Our mapping of the current region (batch) */ char *region_base; + char *target_buf; /* A temporary mapping, and a copy, of one frame of guest memory. */ unsigned long *page = NULL; int nraces = 0; @@ -1241,21 +1243,24 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx, region_mfn[i] = ctx->hvm ? pfn : ctx->p2m[pfn]; } - /* Map relevant mfns */ - pfn_err = calloc(j, sizeof(*pfn_err)); - if ( pfn_err == NULL ) + if ( !callbacks || !callbacks->get_page) { - PERROR("allocation for pfn_err failed"); - return -1; - } - region_base = xc_map_foreign_bulk( - xch, dom, PROT_WRITE, region_mfn, pfn_err, j); + /* Map relevant mfns */ + pfn_err = calloc(j, sizeof(*pfn_err)); + if ( pfn_err == NULL ) + { + PERROR("allocation for pfn_err failed"); + return -1; + } + region_base = xc_map_foreign_bulk( + xch, dom, PROT_WRITE, region_mfn, pfn_err, j); - if ( region_base == NULL ) - { - PERROR("map batch failed"); - free(pfn_err); - return -1; + if ( region_base == NULL ) + { + PERROR("map batch failed"); + free(pfn_err); + return -1; + } } for ( i = 0, curpage = -1; i < j; i++ ) @@ -1279,7 +1284,7 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx, continue; } - if (pfn_err[i]) + if ( (!callbacks || !callbacks->get_page) && pfn_err[i] ) { ERROR("unexpected PFN mapping failure pfn %lx map_mfn %lx p2m_mfn %lx", pfn, region_mfn[i], ctx->p2m[pfn]); @@ -1298,8 +1303,20 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx, mfn = ctx->p2m[pfn]; + if ( callbacks && callbacks->get_page ) + { + target_buf = callbacks->get_page(&callbacks->comm_data, + callbacks->data, pfn); + if ( !target_buf ) + { + ERROR("Cannot get a buffer to store memory"); + goto err_mapped; + } + } + else + target_buf = region_base + i*PAGE_SIZE; /* In verify mode, we use a copy; otherwise we work in place */ - page = pagebuf->verify ? (void *)buf : (region_base + i*PAGE_SIZE); + page = pagebuf->verify ? (void *)buf : target_buf; /* Remus - page decompression */ if (pagebuf->compressing) @@ -1357,27 +1374,26 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx, if ( pagebuf->verify ) { - int res = memcmp(buf, (region_base + i*PAGE_SIZE), PAGE_SIZE); + int res = memcmp(buf, target_buf, PAGE_SIZE); if ( res ) { int v; DPRINTF("************** pfn=%lx type=%lx gotcs=%08lx " "actualcs=%08lx\n", pfn, pagebuf->pfn_types[pfn], - csum_page(region_base + (i + curbatch)*PAGE_SIZE), + csum_page(target_buf), csum_page(buf)); for ( v = 0; v < 4; v++ ) { - unsigned long *p = (unsigned long *) - (region_base + i*PAGE_SIZE); + unsigned long *p = (unsigned long *)target_buf; if ( buf[v] != p[v] ) DPRINTF(" %d: %08lx %08lx\n", v, buf[v], p[v]); } } } - if ( !ctx->hvm && + if ( (!callbacks || !callbacks->get_page) && !ctx->hvm && xc_add_mmu_update(xch, mmu, (((unsigned long long)mfn) << PAGE_SHIFT) | MMU_MACHPHYS_UPDATE, pfn) ) @@ -1390,8 +1406,11 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx, rc = nraces; err_mapped: - munmap(region_base, j*PAGE_SIZE); - free(pfn_err); + if ( !callbacks || !callbacks->get_page ) + { + munmap(region_base, j*PAGE_SIZE); + free(pfn_err); + } return rc; } @@ -1461,6 +1480,9 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, struct restore_ctx *ctx = &_ctx; struct domain_info_context *dinfo = &ctx->dinfo; + struct restore_data *comm_data = NULL; + void *data = NULL; + DPRINTF("%s: starting restore of new domid %u", __func__, dom); pagebuf_init(&pagebuf); @@ -1582,6 +1604,33 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, goto out; } + /* init callbacks->comm_data */ + if ( callbacks ) + { + callbacks->comm_data.xch = xch; + callbacks->comm_data.dom = dom; + callbacks->comm_data.dinfo = dinfo; + callbacks->comm_data.io_fd = io_fd; + callbacks->comm_data.hvm = hvm; + callbacks->comm_data.pfn_type = pfn_type; + callbacks->comm_data.mmu = mmu; + callbacks->comm_data.p2m_frame_list = p2m_frame_list; + callbacks->comm_data.p2m = ctx->p2m; + comm_data = &callbacks->comm_data; + + /* init callbacks->data */ + if ( callbacks->init) + { + callbacks->data = NULL; + if (callbacks->init(&callbacks->comm_data, &callbacks->data) < 0 ) + { + ERROR("Could not initialise restore callbacks private data"); + goto out; + } + } + data = callbacks->data; + } + xc_report_progress_start(xch, "Reloading memory pages", dinfo->p2m_size); /* @@ -1676,7 +1725,8 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, int brc; brc = apply_batch(xch, dom, ctx, region_mfn, pfn_type, - pae_extended_cr3, mmu, &pagebuf, curbatch); + pae_extended_cr3, mmu, &pagebuf, curbatch, + callbacks); if ( brc < 0 ) goto out; @@ -1761,6 +1811,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, goto finish; } +getpages: // DPRINTF("Buffered checkpoint\n"); if ( pagebuf_get(xch, ctx, &pagebuf, io_fd, dom) ) { @@ -1902,58 +1953,69 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, } } - /* - * Pin page tables. Do this after writing to them as otherwise Xen - * will barf when doing the type-checking. - */ - nr_pins = 0; - for ( i = 0; i < dinfo->p2m_size; i++ ) + if ( callbacks && callbacks->flush_memory ) { - if ( (pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 ) - continue; - - switch ( pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) + if ( callbacks->flush_memory(comm_data, data) < 0 ) { - case XEN_DOMCTL_PFINFO_L1TAB: - pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE; - break; + ERROR("Error doing callbacks->flush_memory()"); + goto out; + } + } + else + { + /* + * Pin page tables. Do this after writing to them as otherwise Xen + * will barf when doing the type-checking. + */ + nr_pins = 0; + for ( i = 0; i < dinfo->p2m_size; i++ ) + { + if ( (pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 ) + continue; - case XEN_DOMCTL_PFINFO_L2TAB: - pin[nr_pins].cmd = MMUEXT_PIN_L2_TABLE; - break; + switch ( pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) + { + case XEN_DOMCTL_PFINFO_L1TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE; + break; - case XEN_DOMCTL_PFINFO_L3TAB: - pin[nr_pins].cmd = MMUEXT_PIN_L3_TABLE; - break; + case XEN_DOMCTL_PFINFO_L2TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L2_TABLE; + break; - case XEN_DOMCTL_PFINFO_L4TAB: - pin[nr_pins].cmd = MMUEXT_PIN_L4_TABLE; - break; + case XEN_DOMCTL_PFINFO_L3TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L3_TABLE; + break; - default: - continue; - } + case XEN_DOMCTL_PFINFO_L4TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L4_TABLE; + break; - pin[nr_pins].arg1.mfn = ctx->p2m[i]; - nr_pins++; + default: + continue; + } - /* Batch full? Then flush. */ - if ( nr_pins == MAX_PIN_BATCH ) - { - if ( xc_mmuext_op(xch, pin, nr_pins, dom) < 0 ) + pin[nr_pins].arg1.mfn = ctx->p2m[i]; + nr_pins++; + + /* Batch full? Then flush. */ + if ( nr_pins == MAX_PIN_BATCH ) { - PERROR("Failed to pin batch of %d page tables", nr_pins); - goto out; + if ( xc_mmuext_op(xch, pin, nr_pins, dom) < 0 ) + { + PERROR("Failed to pin batch of %d page tables", nr_pins); + goto out; + } + nr_pins = 0; } - nr_pins = 0; } - } - /* Flush final partial batch. */ - if ( (nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) < 0) ) - { - PERROR("Failed to pin batch of %d page tables", nr_pins); - goto out; + /* Flush final partial batch. */ + if ( (nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) < 0) ) + { + PERROR("Failed to pin batch of %d page tables", nr_pins); + goto out; + } } DPRINTF("Memory reloaded (%ld pages)\n", ctx->nr_pfns); @@ -2052,6 +2114,8 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, *console_mfn = ctx->p2m[GET_FIELD(start_info, console.domU.mfn)]; SET_FIELD(start_info, console.domU.mfn, *console_mfn); SET_FIELD(start_info, console.domU.evtchn, console_evtchn); + callbacks->comm_data.store_mfn = *store_mfn; + callbacks->comm_data.console_mfn = *console_mfn; munmap(start_info, PAGE_SIZE); } /* Uncanonicalise each GDT frame number. */ @@ -2199,37 +2263,61 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, /* leave wallclock time. set by hypervisor */ munmap(new_shared_info, PAGE_SIZE); - /* Uncanonicalise the pfn-to-mfn table frame-number list. */ - for ( i = 0; i < P2M_FL_ENTRIES; i++ ) + if ( callbacks && callbacks->update_p2m ) { - pfn = p2m_frame_list[i]; - if ( (pfn >= dinfo->p2m_size) || (pfn_type[pfn] != XEN_DOMCTL_PFINFO_NOTAB) ) + if ( callbacks->update_p2m(comm_data, data) < 0 ) { - ERROR("PFN-to-MFN frame number %i (%#lx) is bad", i, pfn); + ERROR("Error doing callbacks->update_p2m()"); goto out; } - p2m_frame_list[i] = ctx->p2m[pfn]; } - - /* Copy the P2M we''ve constructed to the ''live'' P2M */ - if ( !(ctx->live_p2m = xc_map_foreign_pages(xch, dom, PROT_WRITE, - p2m_frame_list, P2M_FL_ENTRIES)) ) + else { - PERROR("Couldn''t map p2m table"); - goto out; + /* Uncanonicalise the pfn-to-mfn table frame-number list. */ + for ( i = 0; i < P2M_FL_ENTRIES; i++ ) + { + pfn = p2m_frame_list[i]; + if ( (pfn >= dinfo->p2m_size) || (pfn_type[pfn] != XEN_DOMCTL_PFINFO_NOTAB) ) + { + ERROR("PFN-to-MFN frame number %i (%#lx) is bad", i, pfn); + goto out; + } + p2m_frame_list[i] = ctx->p2m[pfn]; + } + + /* Copy the P2M we''ve constructed to the ''live'' P2M */ + if ( !(ctx->live_p2m = xc_map_foreign_pages(xch, dom, PROT_WRITE, + p2m_frame_list, P2M_FL_ENTRIES)) ) + { + PERROR("Couldn''t map p2m table"); + goto out; + } + + /* If the domain we''re restoring has a different word size to ours, + * we need to adjust the live_p2m assignment appropriately */ + if ( dinfo->guest_width > sizeof (xen_pfn_t) ) + for ( i = dinfo->p2m_size - 1; i >= 0; i-- ) + ((int64_t *)ctx->live_p2m)[i] = (long)ctx->p2m[i]; + else if ( dinfo->guest_width < sizeof (xen_pfn_t) ) + for ( i = 0; i < dinfo->p2m_size; i++ ) + ((uint32_t *)ctx->live_p2m)[i] = ctx->p2m[i]; + else + memcpy(ctx->live_p2m, ctx->p2m, dinfo->p2m_size * sizeof(xen_pfn_t)); + munmap(ctx->live_p2m, P2M_FL_ENTRIES * PAGE_SIZE); } - /* If the domain we''re restoring has a different word size to ours, - * we need to adjust the live_p2m assignment appropriately */ - if ( dinfo->guest_width > sizeof (xen_pfn_t) ) - for ( i = dinfo->p2m_size - 1; i >= 0; i-- ) - ((int64_t *)ctx->live_p2m)[i] = (long)ctx->p2m[i]; - else if ( dinfo->guest_width < sizeof (xen_pfn_t) ) - for ( i = 0; i < dinfo->p2m_size; i++ ) - ((uint32_t *)ctx->live_p2m)[i] = ctx->p2m[i]; - else - memcpy(ctx->live_p2m, ctx->p2m, dinfo->p2m_size * sizeof(xen_pfn_t)); - munmap(ctx->live_p2m, P2M_FL_ENTRIES * PAGE_SIZE); + if ( callbacks && callbacks->finish_restotre ) + { + rc = callbacks->finish_restotre(comm_data, data); + if ( rc == 1 ) + goto getpages; + + if ( rc < 0 ) + { + ERROR("Er1ror doing callbacks->finish_restotre()"); + goto out; + } + } rc = xc_dom_gnttab_seed(xch, dom, *console_mfn, *store_mfn, console_domid, store_domid); @@ -2329,6 +2417,8 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom, rc = 0; out: + if ( callbacks && callbacks->free && callbacks->data) + callbacks->free(&callbacks->comm_data, callbacks->data); if ( (rc != 0) && (dom != 0) ) xc_domain_destroy(xch, dom); xc_hypercall_buffer_free(xch, ctxt); diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h index 4714bd2..4bb444a 100644 --- a/tools/libxc/xenguest.h +++ b/tools/libxc/xenguest.h @@ -90,12 +90,60 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter unsigned long vm_generationid_addr); +/* pass the variable defined in xc_domain_restore() to callback. Use + * this structure for the following purpose: + * 1. avoid too many arguments. + * 2. different callback implemention may need different arguments. + * Just add the information you need here. + */ +struct restore_data +{ + xc_interface *xch; + uint32_t dom; + struct domain_info_context *dinfo; + int io_fd; + int hvm; + unsigned long *pfn_type; + struct xc_mmu *mmu; + unsigned long *p2m_frame_list; + unsigned long *p2m; + unsigned long console_mfn; + unsigned long store_mfn; +}; + /* callbacks provided by xc_domain_restore */ struct restore_callbacks { + /* callback to init data */ + int (*init)(struct restore_data *comm_data, void **data); + /* callback to free data */ + void (*free)(struct restore_data *comm_data, void *data); + /* callback to get a buffer to store memory data that is transfered + * from the source machine. + */ + char *(*get_page)(struct restore_data *comm_data, void *data, + unsigned long pfn); + /* callback to flush memory that is transfered from the source machine + * to the guest. Update the guest''s pagetable if necessary. + */ + int (*flush_memory)(struct restore_data *comm_data, void *data); + /* callback to update the guest''s p2m table */ + int (*update_p2m)(struct restore_data *comm_data, void *data); + /* callback to finish restore process. It is called before xc_domain_restore() + * returns. + * + * Return value: + * -1: error + * 0: continue to start vm + * 1: continue to do a checkpoint + */ + int (*finish_restotre)(struct restore_data *comm_data, void *data); /* callback to restore toolstack specific data */ int (*toolstack_restore)(uint32_t domid, const uint8_t *buf, uint32_t size, void* data); + /* xc_domain_restore() init it */ + struct restore_data comm_data; + /* to be provided as the last argument to each callback function */ void* data; }; -- 1.7.4
Wen Congyang
2013-Jul-11 08:35 UTC
[RFC Patch v2 06/16] colo: implement restore_callbacks init()/free()
This patch implements restore callbacks for colo: 1. init(): allocate some memory 2. free(): free the memory allocated in init() Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/libxc/Makefile | 2 +- tools/libxc/xc_domain_restore_colo.c | 145 ++++++++++++++++++++++++++++++++++ tools/libxc/xc_save_restore_colo.h | 10 +++ 3 files changed, 156 insertions(+), 1 deletions(-) create mode 100644 tools/libxc/xc_domain_restore_colo.c create mode 100644 tools/libxc/xc_save_restore_colo.h diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile index 512a994..70994b9 100644 --- a/tools/libxc/Makefile +++ b/tools/libxc/Makefile @@ -42,7 +42,7 @@ CTRL_SRCS-$(CONFIG_MiniOS) += xc_minios.c GUEST_SRCS-y : GUEST_SRCS-y += xg_private.c xc_suspend.c ifeq ($(CONFIG_MIGRATE),y) -GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c +GUEST_SRCS-y += xc_domain_restore.c xc_domain_save.c xc_domain_restore_colo.c GUEST_SRCS-y += xc_offline_page.c xc_compression.c else GUEST_SRCS-y += xc_nomigrate.c diff --git a/tools/libxc/xc_domain_restore_colo.c b/tools/libxc/xc_domain_restore_colo.c new file mode 100644 index 0000000..674e55e --- /dev/null +++ b/tools/libxc/xc_domain_restore_colo.c @@ -0,0 +1,145 @@ +#include <xc_save_restore_colo.h> +#include <sys/types.h> +#include <sys/wait.h> +#include <xc_bitops.h> + +struct restore_colo_data +{ + unsigned long max_mem_pfn; + + /* cache the whole memory + * + * The SVM is running in colo mode, so should cache the whole memory + * of SVM. + */ + char* pagebase; + + /* which page is dirty? */ + unsigned long *dirty_pages; + + /* suspend evtchn */ + int local_port; + + xc_evtchn *xce; + + int first_time; + + /* PV */ + /* store the pfn type on slave side */ + unsigned long *pfn_type_slaver; + xen_pfn_t p2m_fll; + + /* cache p2m frame list list */ + char *p2m_frame_list_list; + + /* cache p2m frame list */ + char *p2m_frame_list; + + /* temp buffer(avoid malloc/free frequently) */ + unsigned long *pfn_batch_slaver; + unsigned long *pfn_type_batch_slaver; + unsigned long *p2m_frame_list_temp; +}; + +/* we restore only one vm in a process, so it is safe to use global variable */ +DECLARE_HYPERCALL_BUFFER(unsigned long, dirty_pages); + +int colo_init(struct restore_data *comm_data, void **data) +{ + xc_dominfo_t info; + int i; + unsigned long size; + xc_interface *xch = comm_data->xch; + struct restore_colo_data *colo_data; + struct domain_info_context *dinfo = comm_data->dinfo; + + if (dirty_pages) + /* restore_colo_init() is called more than once?? */ + return -1; + + colo_data = calloc(1, sizeof(struct restore_colo_data)); + if (!colo_data) + return -1; + + if (comm_data->hvm) + { + /* hvm is unsupported now */ + free(colo_data); + return -1; + } + + if (xc_domain_getinfo(xch, comm_data->dom, 1, &info) != 1) + { + PERROR("Could not get domain info"); + goto err; + } + + colo_data->max_mem_pfn = info.max_memkb >> (PAGE_SHIFT - 10); + + colo_data->pfn_type_slaver = calloc(dinfo->p2m_size, sizeof(xen_pfn_t)); + colo_data->pfn_batch_slaver = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t)); + colo_data->pfn_type_batch_slaver = calloc(MAX_BATCH_SIZE, sizeof(xen_pfn_t)); + colo_data->p2m_frame_list_temp = malloc(P2M_FL_ENTRIES * sizeof(unsigned long)); + colo_data->p2m_frame_list_list = malloc(PAGE_SIZE); + colo_data->p2m_frame_list = malloc(P2M_FLL_ENTRIES * PAGE_SIZE); + if (!colo_data->pfn_type_slaver || !colo_data->pfn_batch_slaver || + !colo_data->pfn_type_batch_slaver || !colo_data->p2m_frame_list_temp || + !colo_data->p2m_frame_list_list || !colo_data->p2m_frame_list) { + PERROR("Could not allocate memory for restore colo data"); + goto err; + } + + dirty_pages = xc_hypercall_buffer_alloc_pages(xch, dirty_pages, + NRPAGES(bitmap_size(dinfo->p2m_size))); + colo_data->dirty_pages = dirty_pages; + + size = dinfo->p2m_size * PAGE_SIZE; + colo_data->pagebase = malloc(size); + if (!colo_data->dirty_pages || !colo_data->pagebase) { + PERROR("Could not allocate memory for restore colo data"); + goto err; + } + + colo_data->xce = xc_evtchn_open(NULL, 0); + if (!colo_data->xce) { + PERROR("Could not open evtchn"); + goto err; + } + + for (i = 0; i < dinfo->p2m_size; i++) + comm_data->pfn_type[i] = XEN_DOMCTL_PFINFO_XTAB; + memset(dirty_pages, 0xff, bitmap_size(dinfo->p2m_size)); + colo_data->first_time = 1; + colo_data->local_port = -1; + *data = colo_data; + + return 0; + +err: + colo_free(comm_data, colo_data); + *data = NULL; + return -1; +} + +void colo_free(struct restore_data *comm_data, void *data) +{ + struct restore_colo_data *colo_data = data; + struct domain_info_context *dinfo = comm_data->dinfo; + + if (!colo_data) + return; + + free(colo_data->pfn_type_slaver); + free(colo_data->pagebase); + free(colo_data->pfn_batch_slaver); + free(colo_data->pfn_type_batch_slaver); + free(colo_data->p2m_frame_list_temp); + free(colo_data->p2m_frame_list); + free(colo_data->p2m_frame_list_list); + if (dirty_pages) + xc_hypercall_buffer_free_pages(comm_data->xch, dirty_pages, + NRPAGES(bitmap_size(dinfo->p2m_size))); + if (colo_data->xce) + xc_evtchn_close(colo_data->xce); + free(colo_data); +} diff --git a/tools/libxc/xc_save_restore_colo.h b/tools/libxc/xc_save_restore_colo.h new file mode 100644 index 0000000..b5416af --- /dev/null +++ b/tools/libxc/xc_save_restore_colo.h @@ -0,0 +1,10 @@ +#ifndef XC_SAVE_RESTORE_COLO_H +#define XC_SAVE_RESTORE_COLO_H + +#include <xg_save_restore.h> +#include <xg_private.h> + +extern int colo_init(struct restore_data *, void **); +extern void colo_free(struct restore_data *, void *); + +#endif -- 1.7.4
Wen Congyang
2013-Jul-11 08:35 UTC
[RFC Patch v2 07/16] colo: implement restore_callbacks get_page()
This patch implements restore callbacks for colo: 1. get_page(): We have cache the whole memory, so just return the buffer. This page is also marked as dirty. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/libxc/xc_domain_restore_colo.c | 9 +++++++++ tools/libxc/xc_save_restore_colo.h | 1 + 2 files changed, 10 insertions(+), 0 deletions(-) diff --git a/tools/libxc/xc_domain_restore_colo.c b/tools/libxc/xc_domain_restore_colo.c index 674e55e..77b63b6 100644 --- a/tools/libxc/xc_domain_restore_colo.c +++ b/tools/libxc/xc_domain_restore_colo.c @@ -143,3 +143,12 @@ void colo_free(struct restore_data *comm_data, void *data) xc_evtchn_close(colo_data->xce); free(colo_data); } + +char* colo_get_page(struct restore_data *comm_data, void *data, + unsigned long pfn) +{ + struct restore_colo_data *colo_data = data; + + set_bit(pfn, colo_data->dirty_pages); + return colo_data->pagebase + pfn * PAGE_SIZE; +} diff --git a/tools/libxc/xc_save_restore_colo.h b/tools/libxc/xc_save_restore_colo.h index b5416af..67c567c 100644 --- a/tools/libxc/xc_save_restore_colo.h +++ b/tools/libxc/xc_save_restore_colo.h @@ -6,5 +6,6 @@ extern int colo_init(struct restore_data *, void **); extern void colo_free(struct restore_data *, void *); +extern char *colo_get_page(struct restore_data *, void *, unsigned long); #endif -- 1.7.4
Wen Congyang
2013-Jul-11 08:35 UTC
[RFC Patch v2 08/16] colo: implement restore_callbacks flush_memory
This patch implements restore callbacks for colo: 1. flush_memory(): We update the memory as the following: a. pin non-dirty L1 pagetables b. unpin pagetables execpt non-dirty L1 c. update the memory d. pin page tables e. unpin non-dirty L1 pagetables Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/libxc/xc_domain_restore_colo.c | 372 ++++++++++++++++++++++++++++++++++ tools/libxc/xc_save_restore_colo.h | 1 + 2 files changed, 373 insertions(+), 0 deletions(-) diff --git a/tools/libxc/xc_domain_restore_colo.c b/tools/libxc/xc_domain_restore_colo.c index 77b63b6..50009fa 100644 --- a/tools/libxc/xc_domain_restore_colo.c +++ b/tools/libxc/xc_domain_restore_colo.c @@ -152,3 +152,375 @@ char* colo_get_page(struct restore_data *comm_data, void *data, set_bit(pfn, colo_data->dirty_pages); return colo_data->pagebase + pfn * PAGE_SIZE; } + +/* Step1: + * + * pin non-dirty L1 pagetables: ~dirty_pages & mL1 (= ~dirty_pages & sL1) + * mL1: L1 pages on master side + * sL1: L1 pages on slaver side + */ +static int pin_l1(struct restore_data *comm_data, + struct restore_colo_data *colo_data) +{ + unsigned int nr_pins = 0; + unsigned long i; + struct mmuext_op pin[MAX_PIN_BATCH]; + struct domain_info_context *dinfo = comm_data->dinfo; + unsigned long *pfn_type = comm_data->pfn_type; + uint32_t dom = comm_data->dom; + xc_interface *xch = comm_data->xch; + unsigned long *pfn_type_slaver = colo_data->pfn_type_slaver; + unsigned long *dirty_pages = colo_data->dirty_pages; + + for (i = 0; i < dinfo->p2m_size; i++) + { + switch ( pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) + { + case XEN_DOMCTL_PFINFO_L1TAB: + if (pfn_type_slaver[i] & XEN_DOMCTL_PFINFO_LPINTAB) + /* don''t pin already pined */ + continue; + + if (test_bit(i, dirty_pages)) + /* don''t pin dirty */ + continue; + + /* here, it must also be L1 in slaver, otherwise it is dirty. + * (add test code ?) + */ + pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE; + break; + + case XEN_DOMCTL_PFINFO_L2TAB: + case XEN_DOMCTL_PFINFO_L3TAB: + case XEN_DOMCTL_PFINFO_L4TAB: + default: + continue; + } + + pin[nr_pins].arg1.mfn = comm_data->p2m[i]; + nr_pins++; + + /* Batch full? Then flush. */ + if (nr_pins == MAX_PIN_BATCH) + { + if (xc_mmuext_op(xch, pin, nr_pins, dom) < 0) + { + PERROR("Failed to pin L1 batch of %d page tables", nr_pins); + return 1; + } + nr_pins = 0; + } + } + + /* Flush final partial batch. */ + if ((nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) < 0)) + { + PERROR("Failed to pin L1 batch of %d page tables", nr_pins); + return 1; + } + + return 0; +} + +/* Step2: + * + * unpin pagetables execpt non-dirty L1: sL2 + sL3 + sL4 + (dirty_pages & sL1) + * sL1: L1 pages on slaver side + * sL2: L2 pages on slaver side + * sL3: L3 pages on slaver side + * sL4: L4 pages on slaver side + */ +static int unpin_pagetable(struct restore_data *comm_data, + struct restore_colo_data *colo_data) +{ + unsigned int nr_pins = 0; + unsigned long i; + struct mmuext_op pin[MAX_PIN_BATCH]; + struct domain_info_context *dinfo = comm_data->dinfo; + uint32_t dom = comm_data->dom; + xc_interface *xch = comm_data->xch; + unsigned long *pfn_type_slaver = colo_data->pfn_type_slaver; + unsigned long *dirty_pages = colo_data->dirty_pages; + + for (i = 0; i < dinfo->p2m_size; i++) + { + if ( (pfn_type_slaver[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 ) + continue; + + switch ( pfn_type_slaver[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) + { + case XEN_DOMCTL_PFINFO_L1TAB: + if (!test_bit(i, dirty_pages)) + /* it is in (~dirty_pages & mL1), keep it */ + continue; + /* fallthrough */ + case XEN_DOMCTL_PFINFO_L2TAB: + case XEN_DOMCTL_PFINFO_L3TAB: + case XEN_DOMCTL_PFINFO_L4TAB: + pin[nr_pins].cmd = MMUEXT_UNPIN_TABLE; + break; + + default: + continue; + } + + pin[nr_pins].arg1.mfn = comm_data->p2m[i]; + nr_pins++; + + /* Batch full? Then flush. */ + if (nr_pins == MAX_PIN_BATCH) + { + if (xc_mmuext_op(xch, pin, nr_pins, dom) < 0) + { + PERROR("Failed to unpin batch of %d page tables", nr_pins); + return 1; + } + nr_pins = 0; + } + } + + /* Flush final partial batch. */ + if ((nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) < 0)) + { + PERROR("Failed to unpin batch of %d page tables", nr_pins); + return 1; + } + + return 0; +} + +/* we have unpined all pagetables except non-diry l1. So it is OK to map the + * dirty memory and update it. + */ +static int update_memory(struct restore_data *comm_data, + struct restore_colo_data *colo_data) +{ + unsigned long pfn; + unsigned long max_mem_pfn = colo_data->max_mem_pfn; + unsigned long *pfn_type = comm_data->pfn_type; + unsigned long pagetype; + uint32_t dom = comm_data->dom; + xc_interface *xch = comm_data->xch; + struct xc_mmu *mmu = comm_data->mmu; + unsigned long *dirty_pages = colo_data->dirty_pages; + char *pagebase = colo_data->pagebase; + int pfn_err = 0; + char *region_base_slaver; + xen_pfn_t region_mfn_slaver; + unsigned long mfn; + char *pagebuff; + + for (pfn = 0; pfn < max_mem_pfn; pfn++) { + if (!test_bit(pfn, dirty_pages)) + continue; + + pagetype = pfn_type[pfn] & XEN_DOMCTL_PFINFO_LTAB_MASK; + if (pagetype == XEN_DOMCTL_PFINFO_XTAB) + /* a bogus/unmapped page: skip it */ + continue; + + mfn = comm_data->p2m[pfn]; + region_mfn_slaver = mfn; + region_base_slaver = xc_map_foreign_bulk(xch, dom, + PROT_WRITE, + ®ion_mfn_slaver, + &pfn_err, 1); + if (!region_base_slaver || pfn_err) { + PERROR("update_memory: xc_map_foreign_bulk failed"); + return 1; + } + + pagebuff = (char *)(pagebase + pfn * PAGE_SIZE); + memcpy(region_base_slaver, pagebuff, PAGE_SIZE); + munmap(region_base_slaver, PAGE_SIZE); + + if (xc_add_mmu_update(xch, mmu, (((uint64_t)mfn) << PAGE_SHIFT) + | MMU_MACHPHYS_UPDATE, pfn) ) + { + PERROR("failed machpys update mfn=%lx pfn=%lx", mfn, pfn); + return 1; + } + } + + /* + * Ensure we flush all machphys updates before potential PAE-specific + * reallocations below. + */ + if (xc_flush_mmu_updates(xch, mmu)) + { + PERROR("Error doing flush_mmu_updates()"); + return 1; + } + + return 0; +} + +/* Step 4: pin master pt + * Pin page tables. Do this after writing to them as otherwise Xen + * will barf when doing the type-checking. + */ +static int pin_pagetable(struct restore_data *comm_data, + struct restore_colo_data *colo_data) +{ + unsigned int nr_pins = 0; + unsigned long i; + struct mmuext_op pin[MAX_PIN_BATCH]; + struct domain_info_context *dinfo = comm_data->dinfo; + unsigned long *pfn_type = comm_data->pfn_type; + uint32_t dom = comm_data->dom; + xc_interface *xch = comm_data->xch; + unsigned long *dirty_pages = colo_data->dirty_pages; + + for ( i = 0; i < dinfo->p2m_size; i++ ) + { + if ( (pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) == 0 ) + continue; + + switch ( pfn_type[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) + { + case XEN_DOMCTL_PFINFO_L1TAB: + if (!test_bit(i, dirty_pages)) + /* it is in (~dirty_pages & mL1)(=~dirty_pages & sL1), + * already pined + */ + continue; + + pin[nr_pins].cmd = MMUEXT_PIN_L1_TABLE; + break; + + case XEN_DOMCTL_PFINFO_L2TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L2_TABLE; + break; + + case XEN_DOMCTL_PFINFO_L3TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L3_TABLE; + break; + + case XEN_DOMCTL_PFINFO_L4TAB: + pin[nr_pins].cmd = MMUEXT_PIN_L4_TABLE; + break; + + default: + continue; + } + + pin[nr_pins].arg1.mfn = comm_data->p2m[i]; + nr_pins++; + + /* Batch full? Then flush. */ + if (nr_pins == MAX_PIN_BATCH) + { + if (xc_mmuext_op(xch, pin, nr_pins, dom) < 0) + { + PERROR("Failed to pin batch of %d page tables", nr_pins); + return 1; + } + nr_pins = 0; + } + } + + /* Flush final partial batch. */ + if ((nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) < 0)) + { + PERROR("Failed to pin batch of %d page tables", nr_pins); + return 1; + } + + return 0; +} + +/* Step5: + * unpin unneeded non-dirty L1 pagetables: ~dirty_pages & mL1 (= ~dirty_pages & sL1) + */ +static int unpin_l1(struct restore_data *comm_data, + struct restore_colo_data *colo_data) +{ + unsigned int nr_pins = 0; + unsigned long i; + struct mmuext_op pin[MAX_PIN_BATCH]; + struct domain_info_context *dinfo = comm_data->dinfo; + unsigned long *pfn_type = comm_data->pfn_type; + uint32_t dom = comm_data->dom; + xc_interface *xch = comm_data->xch; + unsigned long *pfn_type_slaver = colo_data->pfn_type_slaver; + unsigned long *dirty_pages = colo_data->dirty_pages; + + for (i = 0; i < dinfo->p2m_size; i++) + { + switch ( pfn_type_slaver[i] & XEN_DOMCTL_PFINFO_LTABTYPE_MASK ) + { + case XEN_DOMCTL_PFINFO_L1TAB: + if (pfn_type[i] & XEN_DOMCTL_PFINFO_LPINTAB) // still needed + continue; + if (test_bit(i, dirty_pages)) // not pined by step 1 + continue; + + pin[nr_pins].cmd = MMUEXT_UNPIN_TABLE; + break; + + case XEN_DOMCTL_PFINFO_L2TAB: + case XEN_DOMCTL_PFINFO_L3TAB: + case XEN_DOMCTL_PFINFO_L4TAB: + default: + continue; + } + + pin[nr_pins].arg1.mfn = comm_data->p2m[i]; + nr_pins++; + + /* Batch full? Then flush. */ + if (nr_pins == MAX_PIN_BATCH) + { + if (xc_mmuext_op(xch, pin, nr_pins, dom) < 0) + { + PERROR("Failed to pin L1 batch of %d page tables", nr_pins); + return 1; + } + nr_pins = 0; + } + } + + /* Flush final partial batch. */ + if ((nr_pins != 0) && (xc_mmuext_op(xch, pin, nr_pins, dom) < 0)) + { + PERROR("Failed to pin L1 batch of %d page tables", nr_pins); + return 1; + } + + return 0; +} + +int colo_flush_memory(struct restore_data *comm_data, void *data) +{ + struct restore_colo_data *colo_data = data; + xc_interface *xch = comm_data->xch; + uint32_t dom = comm_data->dom; + DECLARE_HYPERCALL; + + if (!colo_data->first_time) + { + /* reset cpu */ + hypercall.op = __HYPERVISOR_reset_vcpu_op; + hypercall.arg[0] = (unsigned long)dom; + do_xen_hypercall(xch, &hypercall); + } + + if (pin_l1(comm_data, colo_data) != 0) + return -1; + if (unpin_pagetable(comm_data, colo_data) != 0) + return -1; + + if (update_memory(comm_data, colo_data) != 0) + return -1; + + if (pin_pagetable(comm_data, colo_data) != 0) + return -1; + if (unpin_l1(comm_data, colo_data) != 0) + return -1; + + memcpy(colo_data->pfn_type_slaver, comm_data->pfn_type, + comm_data->dinfo->p2m_size * sizeof(xen_pfn_t)); + + return 0; +} diff --git a/tools/libxc/xc_save_restore_colo.h b/tools/libxc/xc_save_restore_colo.h index 67c567c..8af75b4 100644 --- a/tools/libxc/xc_save_restore_colo.h +++ b/tools/libxc/xc_save_restore_colo.h @@ -7,5 +7,6 @@ extern int colo_init(struct restore_data *, void **); extern void colo_free(struct restore_data *, void *); extern char *colo_get_page(struct restore_data *, void *, unsigned long); +extern int colo_flush_memory(struct restore_data *, void *); #endif -- 1.7.4
Wen Congyang
2013-Jul-11 08:35 UTC
[RFC Patch v2 09/16] colo: implement restore_callbacks update_p2m()
This patch implements restore callbacks for colo: 1. update_p2m(): Just update the dirty pages which store p2m. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/libxc/xc_domain_restore_colo.c | 78 ++++++++++++++++++++++++++++++++++ tools/libxc/xc_save_restore_colo.h | 1 + 2 files changed, 79 insertions(+), 0 deletions(-) diff --git a/tools/libxc/xc_domain_restore_colo.c b/tools/libxc/xc_domain_restore_colo.c index 50009fa..70cdd16 100644 --- a/tools/libxc/xc_domain_restore_colo.c +++ b/tools/libxc/xc_domain_restore_colo.c @@ -524,3 +524,81 @@ int colo_flush_memory(struct restore_data *comm_data, void *data) return 0; } + +int colo_update_p2m_table(struct restore_data *comm_data, void *data) +{ + struct restore_colo_data *colo_data = data; + unsigned long i, j, n, pfn; + unsigned long *p2m_frame_list = comm_data->p2m_frame_list; + struct domain_info_context *dinfo = comm_data->dinfo; + unsigned long *pfn_type = comm_data->pfn_type; + xc_interface *xch = comm_data->xch; + uint32_t dom = comm_data->dom; + unsigned long *dirty_pages = colo_data->dirty_pages; + unsigned long *p2m_frame_list_temp = colo_data->p2m_frame_list_temp; + + /* A temporay mapping of the guest''s p2m table(all dirty pages) */ + xen_pfn_t *live_p2m; + /* A temporay mapping of the guest''s p2m table(1 page) */ + xen_pfn_t *live_p2m_one; + unsigned long *p2m; + + j = 0; + for (i = 0; i < P2M_FL_ENTRIES; i++) + { + pfn = p2m_frame_list[i]; + if ((pfn >= dinfo->p2m_size) || (pfn_type[pfn] != XEN_DOMCTL_PFINFO_NOTAB)) + { + ERROR("PFN-to-MFN frame number %li (%#lx) is bad", i, pfn); + return -1; + } + + if (!test_bit(pfn, dirty_pages)) + continue; + + p2m_frame_list_temp[j++] = comm_data->p2m[pfn]; + } + + if (j) + { + /* Copy the P2M we''ve constructed to the ''live'' P2M */ + if (!(live_p2m = xc_map_foreign_pages(xch, dom, PROT_WRITE, + p2m_frame_list_temp, j))) + { + PERROR("Couldn''t map p2m table"); + return -1; + } + + j = 0; + for (i = 0; i < P2M_FL_ENTRIES; i++) + { + pfn = p2m_frame_list[i]; + if (!test_bit(pfn, dirty_pages)) + continue; + + live_p2m_one = (xen_pfn_t *)((char *)live_p2m + PAGE_SIZE * j++); + /* If the domain we''re restoring has a different word size to ours, + * we need to adjust the live_p2m assignment appropriately */ + if (dinfo->guest_width > sizeof (xen_pfn_t)) + { + n = (i + 1) * FPP - 1; + for (i = FPP - 1; i >= 0; i--) + ((uint64_t *)live_p2m_one)[i] = (long)comm_data->p2m[n--]; + } + else if (dinfo->guest_width < sizeof (xen_pfn_t)) + { + n = i * FPP; + for (i = 0; i < FPP; i++) + ((uint32_t *)live_p2m_one)[i] = comm_data->p2m[n++]; + } + else + { + p2m = (xen_pfn_t *)((char *)comm_data->p2m + PAGE_SIZE * i); + memcpy(live_p2m_one, p2m, PAGE_SIZE); + } + } + munmap(live_p2m, j * PAGE_SIZE); + } + + return 0; +} diff --git a/tools/libxc/xc_save_restore_colo.h b/tools/libxc/xc_save_restore_colo.h index 8af75b4..98e5128 100644 --- a/tools/libxc/xc_save_restore_colo.h +++ b/tools/libxc/xc_save_restore_colo.h @@ -8,5 +8,6 @@ extern int colo_init(struct restore_data *, void **); extern void colo_free(struct restore_data *, void *); extern char *colo_get_page(struct restore_data *, void *, unsigned long); extern int colo_flush_memory(struct restore_data *, void *); +extern int colo_update_p2m_table(struct restore_data *, void *); #endif -- 1.7.4
Wen Congyang
2013-Jul-11 08:35 UTC
[RFC Patch v2 10/16] colo: implement restore_callbacks finish_restore()
This patch implements restore callbacks for colo: 1. finish_store(): We run xc_restore in XendCheckpoint.py. We communicate with XendCheckpoint.py like this: a. write "finish\n" to stdout when we are ready to resume the vm. b. XendCheckpoint.py writes "resume" when the vm is resumed c. write "resume" to master when postresume is done d. "continue" is read from master when a new checkpoint begins e. write "suspend" to master when the vm is suspended f. "start" is read from master when primary begins to transfer dirty pages. SVM is running in colo mode, so we should suspend it to sync the state and resume it. We need to fix p2m_frame_list_list before resuming the SVM. The content of p2m_frame_list_list should be cached after suspending SVM. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/libxc/Makefile | 6 +- tools/libxc/xc_domain_restore_colo.c | 335 ++++++++++++++++++++++++++++++++++ tools/libxc/xc_save_restore_colo.h | 1 + tools/libxl/Makefile | 2 +- tools/xcutils/Makefile | 4 +- 5 files changed, 342 insertions(+), 6 deletions(-) diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile index 70994b9..92d11af 100644 --- a/tools/libxc/Makefile +++ b/tools/libxc/Makefile @@ -49,7 +49,7 @@ GUEST_SRCS-y += xc_nomigrate.c endif vpath %.c ../../xen/common/libelf -CFLAGS += -I../../xen/common/libelf +CFLAGS += -I../../xen/common/libelf -I../xenstore ELF_SRCS-y += libelf-tools.c libelf-loader.c ELF_SRCS-y += libelf-dominfo.c @@ -199,8 +199,8 @@ xc_dom_bzimageloader.o: CFLAGS += $(call zlib-options,D) xc_dom_bzimageloader.opic: CFLAGS += $(call zlib-options,D) libxenguest.so.$(MAJOR).$(MINOR): COMPRESSION_LIBS = $(call zlib-options,l) -libxenguest.so.$(MAJOR).$(MINOR): $(GUEST_PIC_OBJS) libxenctrl.so - $(CC) $(LDFLAGS) -Wl,$(SONAME_LDFLAG) -Wl,libxenguest.so.$(MAJOR) $(SHLIB_LDFLAGS) -o $@ $(GUEST_PIC_OBJS) $(COMPRESSION_LIBS) -lz $(LDLIBS_libxenctrl) $(PTHREAD_LIBS) $(APPEND_LDFLAGS) +libxenguest.so.$(MAJOR).$(MINOR): $(GUEST_PIC_OBJS) libxenctrl.so $(LDLIBS_libxenstore) + $(CC) $(LDFLAGS) -Wl,$(SONAME_LDFLAG) -Wl,libxenguest.so.$(MAJOR) $(SHLIB_LDFLAGS) -o $@ $(GUEST_PIC_OBJS) $(COMPRESSION_LIBS) -lz $(LDLIBS_libxenctrl) $(PTHREAD_LIBS) $(LDLIBS_libxenstore) $(APPEND_LDFLAGS) xenctrl_osdep_ENOSYS.so: $(OSDEP_PIC_OBJS) libxenctrl.so $(CC) -g $(LDFLAGS) $(SHLIB_LDFLAGS) -o $@ $(OSDEP_PIC_OBJS) $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS) diff --git a/tools/libxc/xc_domain_restore_colo.c b/tools/libxc/xc_domain_restore_colo.c index 70cdd16..6b87a2d 100644 --- a/tools/libxc/xc_domain_restore_colo.c +++ b/tools/libxc/xc_domain_restore_colo.c @@ -2,6 +2,7 @@ #include <sys/types.h> #include <sys/wait.h> #include <xc_bitops.h> +#include <xenstore.h> struct restore_colo_data { @@ -602,3 +603,337 @@ int colo_update_p2m_table(struct restore_data *comm_data, void *data) return 0; } + +static int update_pfn_type(xc_interface *xch, uint32_t dom, int count, xen_pfn_t *pfn_batch, + xen_pfn_t *pfn_type_batch, xen_pfn_t *pfn_type) +{ + unsigned long k; + + if (xc_get_pfn_type_batch(xch, dom, count, pfn_type_batch)) + { + ERROR("xc_get_pfn_type_batch for slaver failed"); + return -1; + } + + for (k = 0; k < count; k++) + pfn_type[pfn_batch[k]] = pfn_type_batch[k] & XEN_DOMCTL_PFINFO_LTAB_MASK; + + return 0; +} + +static int install_fw_network(struct restore_data *comm_data) +{ + pid_t pid; + xc_interface *xch = comm_data->xch; + int status; + int rc; + + char vif[20]; + + snprintf(vif, sizeof(vif), "vif%u.0", comm_data->dom); + + pid = vfork(); + if (pid < 0) { + ERROR("vfork fails"); + return -1; + } + + if (pid > 0) { + rc = waitpid(pid, &status, 0); + if (rc != pid || !WIFEXITED(status) || WEXITSTATUS(status) != 0) { + ERROR("getting child status fails"); + return -1; + } + + return 0; + } + + execl("/etc/xen/scripts/network-colo", "network-colo", "slaver", "install", vif, "eth0", NULL); + ERROR("execl fails"); + return -1; +} + +static int get_p2m_list(struct restore_data *comm_data, + struct restore_colo_data *colo_data, + xen_pfn_t *p2m_fll, + xen_pfn_t **p2m_frame_list_list_p, + char **p2m_frame_list_p, + int prot) +{ + struct domain_info_context *dinfo = comm_data->dinfo; + xc_interface *xch = comm_data->xch; + uint32_t dom = comm_data->dom; + shared_info_t *shinfo = NULL; + xc_dominfo_t info; + xen_pfn_t *p2m_frame_list_list = NULL; + char *p2m_frame_list = NULL; + int rc = -1; + + if ( xc_domain_getinfo(xch, dom, 1, &info) != 1 ) + { + ERROR("Could not get domain info"); + return -1; + } + + /* Map the shared info frame */ + shinfo = xc_map_foreign_range(xch, dom, PAGE_SIZE, + prot, + info.shared_info_frame); + if ( shinfo == NULL ) + { + ERROR("Couldn''t map shared info"); + return -1; + } + + if (p2m_fll == NULL) + shinfo->arch.pfn_to_mfn_frame_list_list = colo_data->p2m_fll; + else + *p2m_fll = shinfo->arch.pfn_to_mfn_frame_list_list; + + p2m_frame_list_list + xc_map_foreign_range(xch, dom, PAGE_SIZE, prot, + shinfo->arch.pfn_to_mfn_frame_list_list); + if ( p2m_frame_list_list == NULL ) + { + ERROR("Couldn''t map p2m_frame_list_list"); + goto error; + } + + p2m_frame_list = xc_map_foreign_pages(xch, dom, prot, + p2m_frame_list_list, + P2M_FLL_ENTRIES); + if ( p2m_frame_list == NULL ) + { + ERROR("Couldn''t map p2m_frame_list"); + goto error; + } + + *p2m_frame_list_list_p = p2m_frame_list_list; + *p2m_frame_list_p = p2m_frame_list; + rc = 0; + +error: + munmap(shinfo, PAGE_SIZE); + if (rc && p2m_frame_list_list) + munmap(p2m_frame_list_list, PAGE_SIZE); + + return rc; +} + +static int update_p2m_list(struct restore_data *comm_data, + struct restore_colo_data *colo_data) +{ + struct domain_info_context *dinfo = comm_data->dinfo; + xen_pfn_t *p2m_frame_list_list = NULL; + char *p2m_frame_list = NULL; + int rc; + + rc = get_p2m_list(comm_data, colo_data, NULL, &p2m_frame_list_list, + &p2m_frame_list, PROT_READ | PROT_WRITE); + if (rc) + return rc; + + memcpy(p2m_frame_list_list, colo_data->p2m_frame_list_list, PAGE_SIZE); + memcpy(p2m_frame_list, colo_data->p2m_frame_list, PAGE_SIZE * P2M_FLL_ENTRIES); + + munmap(p2m_frame_list_list, PAGE_SIZE); + munmap(p2m_frame_list, PAGE_SIZE * P2M_FLL_ENTRIES); + + return 0; +} + +static int cache_p2m_list(struct restore_data *comm_data, + struct restore_colo_data *colo_data) +{ + struct domain_info_context *dinfo = comm_data->dinfo; + xen_pfn_t *p2m_frame_list_list = NULL; + char *p2m_frame_list = NULL; + int rc; + + rc = get_p2m_list(comm_data, colo_data, &colo_data->p2m_fll, + &p2m_frame_list_list, &p2m_frame_list, PROT_READ); + if (rc) + return rc; + + memcpy(colo_data->p2m_frame_list_list, p2m_frame_list_list, PAGE_SIZE); + memcpy(colo_data->p2m_frame_list, p2m_frame_list, PAGE_SIZE * P2M_FLL_ENTRIES); + + munmap(p2m_frame_list_list, PAGE_SIZE); + munmap(p2m_frame_list, PAGE_SIZE * P2M_FLL_ENTRIES); + + return 0; +} + +/* we are ready to start the guest when this functions is called. We + * will return until we need to do a new checkpoint or some error occurs. + * + * communication with python and master + * python code restore code master comment + * <=== "continue" a new checkpoint begins + * "suspend" ===> SVM is suspended + * "start" getting dirty pages begins + * <=== "finish\n" SVM is ready + * "resume" ===> SVM is resumed + * "resume" ===> postresume is done + * + * return value: + * -1: error + * 0: continue to start vm + * 1: continue to do a checkpoint + */ +int colo_finish_restore(struct restore_data *comm_data, void *data) +{ + struct restore_colo_data *colo_data = data; + xc_interface *xch = comm_data->xch; + uint32_t dom = comm_data->dom; + struct domain_info_context *dinfo = comm_data->dinfo; + xc_evtchn *xce = colo_data->xce; + unsigned long *pfn_batch_slaver = colo_data->pfn_batch_slaver; + unsigned long *pfn_type_batch_slaver = colo_data->pfn_type_batch_slaver; + unsigned long *pfn_type_slaver = colo_data->pfn_type_slaver; + + unsigned long i, j; + int rc; + char str[10]; + int remote_port; + int local_port = colo_data->local_port; + + /* fix pfn_to_mfn_frame_list_list */ + if (!colo_data->first_time) + { + if (update_p2m_list(comm_data, colo_data) < 0) + return -1; + } + + /* output the store-mfn & console-mfn */ + printf("store-mfn %li\n", comm_data->store_mfn); + printf("console-mfn %li\n", comm_data->console_mfn); + + /* notify python code checkpoint finish */ + printf("finish\n"); + fflush(stdout); + + /* we need to know which pages are dirty to restore the guest */ + if (xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY, + NULL, 0, NULL, 0, NULL) < 0 ) + { + ERROR("enabling logdirty fails"); + return -1; + } + + /* wait domain resume, then connect the suspend evtchn */ + read_exact(0, str, 6); + str[6] = ''\0''; + if (strcmp(str, "resume")) + { + ERROR("read %s, expect resume", str); + return -1; + } + + if (colo_data->first_time) { + if (install_fw_network(comm_data) < 0) + return -1; + } + + /* notify master vm is resumed */ + write_exact(comm_data->io_fd, "resume", 6); + + if (colo_data->first_time) { + sleep(10); + remote_port = xs_suspend_evtchn_port(dom); + if (remote_port < 0) { + ERROR("getting remote suspend port fails"); + return -1; + } + + local_port = xc_suspend_evtchn_init(xch, xce, dom, remote_port); + if (local_port < 0) { + ERROR("initializing suspend evtchn fails"); + return -1; + } + + colo_data->local_port = local_port; + } + + /* wait for the next checkpoint */ + read_exact(comm_data->io_fd, str, 8); + str[8] = ''\0''; + if (strcmp(str, "continue")) + { + ERROR("wait for a new checkpoint fails"); + /* start the guest now? */ + return 0; + } + + /* notify the suspend evtchn */ + rc = xc_evtchn_notify(xce, local_port); + if (rc < 0) + { + ERROR("notifying the suspend evtchn fails"); + return -1; + } + + rc = xc_await_suspend(xch, xce, local_port); + if (rc < 0) + { + ERROR("waiting suspend fails"); + return -1; + } + + /* notify master suspend is done */ + write_exact(comm_data->io_fd, "suspend", 7); + read_exact(comm_data->io_fd, str, 5); + str[5] = ''\0''; + if (strcmp(str, "start")) + return -1; + + if (xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_CLEAN, + HYPERCALL_BUFFER(dirty_pages), dinfo->p2m_size, + NULL, 0, NULL) != dinfo->p2m_size) + { + ERROR("getting slaver dirty fails"); + return -1; + } + + if (xc_shadow_control(xch, dom, XEN_DOMCTL_SHADOW_OP_OFF, NULL, 0, NULL, + 0, NULL) < 0 ) + { + ERROR("disabling dirty-log fails"); + return -1; + } + + j = 0; + for (i = 0; i < colo_data->max_mem_pfn; i++) + { + if ( !test_bit(i, colo_data->dirty_pages) ) + continue; + + pfn_batch_slaver[j] = i; + pfn_type_batch_slaver[j++] = comm_data->p2m[i]; + if (j == MAX_BATCH_SIZE) + { + if (update_pfn_type(xch, dom, j, pfn_batch_slaver, + pfn_type_batch_slaver, pfn_type_slaver)) + { + return -1; + } + j = 0; + } + } + + if (j) + { + if (update_pfn_type(xch, dom, j, pfn_batch_slaver, + pfn_type_batch_slaver, pfn_type_slaver)) + { + return -1; + } + } + + if (cache_p2m_list(comm_data, colo_data) < 0) + return -1; + + colo_data->first_time = 0; + + return 1; +} diff --git a/tools/libxc/xc_save_restore_colo.h b/tools/libxc/xc_save_restore_colo.h index 98e5128..57df750 100644 --- a/tools/libxc/xc_save_restore_colo.h +++ b/tools/libxc/xc_save_restore_colo.h @@ -9,5 +9,6 @@ extern void colo_free(struct restore_data *, void *); extern char *colo_get_page(struct restore_data *, void *, unsigned long); extern int colo_flush_memory(struct restore_data *, void *); extern int colo_update_p2m_table(struct restore_data *, void *); +extern int colo_finish_restore(struct restore_data *, void *); #endif diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile index cf214bb..36b924d 100644 --- a/tools/libxl/Makefile +++ b/tools/libxl/Makefile @@ -192,7 +192,7 @@ xl: $(XL_OBJS) libxlutil.so libxenlight.so $(CC) $(LDFLAGS) -o $@ $(XL_OBJS) libxlutil.so $(LDLIBS_libxenlight) $(LDLIBS_libxenctrl) -lyajl $(APPEND_LDFLAGS) libxl-save-helper: $(SAVE_HELPER_OBJS) libxenlight.so - $(CC) $(LDFLAGS) -o $@ $(SAVE_HELPER_OBJS) $(LDLIBS_libxenctrl) $(LDLIBS_libxenguest) $(APPEND_LDFLAGS) + $(CC) $(LDFLAGS) -o $@ $(SAVE_HELPER_OBJS) $(LDLIBS_libxenctrl) $(LDLIBS_libxenguest) $(LDLIBS_libxenstore) $(APPEND_LDFLAGS) testidl: testidl.o libxlutil.so libxenlight.so $(CC) $(LDFLAGS) -o $@ testidl.o libxlutil.so $(LDLIBS_libxenlight) $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS) diff --git a/tools/xcutils/Makefile b/tools/xcutils/Makefile index 6c502f1..51f3f0e 100644 --- a/tools/xcutils/Makefile +++ b/tools/xcutils/Makefile @@ -27,13 +27,13 @@ all: build build: $(PROGRAMS) xc_restore: xc_restore.o - $(CC) $(LDFLAGS) $^ -o $@ $(LDLIBS_libxenctrl) $(LDLIBS_libxenguest) $(APPEND_LDFLAGS) + $(CC) $(LDFLAGS) $^ -o $@ $(LDLIBS_libxenctrl) $(LDLIBS_libxenguest) $(LDLIBS_libxenstore) $(APPEND_LDFLAGS) xc_save: xc_save.o $(CC) $(LDFLAGS) $^ -o $@ $(LDLIBS_libxenctrl) $(LDLIBS_libxenguest) $(LDLIBS_libxenstore) $(APPEND_LDFLAGS) readnotes: readnotes.o - $(CC) $(LDFLAGS) $^ -o $@ $(LDLIBS_libxenctrl) $(LDLIBS_libxenguest) $(APPEND_LDFLAGS) + $(CC) $(LDFLAGS) $^ -o $@ $(LDLIBS_libxenctrl) $(LDLIBS_libxenguest) $(LDLIBS_libxenstore) $(APPEND_LDFLAGS) lsevtchn: lsevtchn.o $(CC) $(LDFLAGS) $^ -o $@ $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS) -- 1.7.4
All restore callbacks have been implemented. Use this callback for colo in xc_restore. Add a new arguement to tell xc_restore if it should use colo mode or not. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/xcutils/xc_restore.c | 36 +++++++++++++++++++++++++++++------- 1 files changed, 29 insertions(+), 7 deletions(-) diff --git a/tools/xcutils/xc_restore.c b/tools/xcutils/xc_restore.c index 35d725c..659c159 100644 --- a/tools/xcutils/xc_restore.c +++ b/tools/xcutils/xc_restore.c @@ -14,6 +14,7 @@ #include <xenctrl.h> #include <xenguest.h> +#include <xc_save_restore_colo.h> int main(int argc, char **argv) @@ -26,10 +27,12 @@ main(int argc, char **argv) unsigned long store_mfn, console_mfn; xentoollog_level lvl; xentoollog_logger *l; + struct restore_callbacks callback, *callback_p; + int colo = 0; - if ( (argc != 8) && (argc != 9) ) + if ( (argc != 8) && (argc != 9) && (argc != 10) ) errx(1, "usage: %s iofd domid store_evtchn " - "console_evtchn hvm pae apic [superpages]", argv[0]); + "console_evtchn hvm pae apic [superpages [colo]]", argv[0]); lvl = XTL_DETAIL; lflags = XTL_STDIOSTREAM_SHOW_PID | XTL_STDIOSTREAM_HIDE_PROGRESS; @@ -46,20 +49,39 @@ main(int argc, char **argv) pae = atoi(argv[6]); apic = atoi(argv[7]); if ( argc == 9 ) - superpages = atoi(argv[8]); + superpages = atoi(argv[8]); else - superpages = !!hvm; + superpages = !!hvm; + + if ( argc == 10 ) + colo = atoi(argv[9]); + + if ( colo ) + { + callback.init = colo_init; + callback.free = colo_free; + callback.get_page = colo_get_page; + callback.flush_memory = colo_flush_memory; + callback.update_p2m = colo_update_p2m_table; + callback.finish_restotre = colo_finish_restore; + callback.data = NULL; + callback_p = &callback; + } + else + { + callback_p = NULL; + } ret = xc_domain_restore(xch, io_fd, domid, store_evtchn, &store_mfn, 0, console_evtchn, &console_mfn, 0, hvm, pae, superpages, - 0, NULL, NULL); + 0, NULL, callback_p); if ( ret == 0 ) { - printf("store-mfn %li\n", store_mfn); + printf("store-mfn %li\n", store_mfn); if ( !hvm ) printf("console-mfn %li\n", console_mfn); - fflush(stdout); + fflush(stdout); } xc_interface_close(xch); -- 1.7.4
In colo mode, XendCheckpoit.py will communicate with both master and xc_restore. This patch implements this communication. In colo mode, the signature is "GuestColoRestore". Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/python/xen/xend/XendCheckpoint.py | 127 +++++++++++++++++++++--------- 1 files changed, 89 insertions(+), 38 deletions(-) diff --git a/tools/python/xen/xend/XendCheckpoint.py b/tools/python/xen/xend/XendCheckpoint.py index fa09757..ed71690 100644 --- a/tools/python/xen/xend/XendCheckpoint.py +++ b/tools/python/xen/xend/XendCheckpoint.py @@ -23,8 +23,11 @@ from xen.xend.XendLogging import log from xen.xend.XendConfig import XendConfig from xen.xend.XendConstants import * from xen.xend import XendNode +from xen.xend.xenstore.xsutil import ResumeDomain +from xen.remus import util SIGNATURE = "LinuxGuestRecord" +COLO_SIGNATURE = "GuestColoRestore" QEMU_SIGNATURE = "QemuDeviceModelRecord" dm_batch = 512 XC_SAVE = "xc_save" @@ -203,10 +206,15 @@ def restore(xd, fd, dominfo = None, paused = False, relocating = False): signature = read_exact(fd, len(SIGNATURE), "not a valid guest state file: signature read") - if signature != SIGNATURE: + if signature != SIGNATURE and signature != COLO_SIGNATURE: raise XendError("not a valid guest state file: found ''%s''" % signature) + if signature == COLO_SIGNATURE: + colo = True + else: + colo = False + l = read_exact(fd, sizeof_int, "not a valid guest state file: config size read") vmconfig_size = unpack("!i", l)[0] @@ -301,12 +309,15 @@ def restore(xd, fd, dominfo = None, paused = False, relocating = False): cmd = map(str, [xen.util.auxbin.pathTo(XC_RESTORE), fd, dominfo.getDomid(), - store_port, console_port, int(is_hvm), pae, apic, superpages]) + store_port, console_port, int(is_hvm), pae, apic, + superpages, int(colo)]) log.debug("[xc_restore]: %s", string.join(cmd)) - handler = RestoreInputHandler() + inputHandler = RestoreInputHandler() + restoreHandler = RestoreHandler(fd, colo, dominfo, inputHandler, + restore_image) - forkHelper(cmd, fd, handler.handler, True) + forkHelper(cmd, fd, inputHandler.handler, not colo, restoreHandler) # We don''t want to pass this fd to any other children -- we # might need to recover the disk space that backs it. @@ -321,42 +332,74 @@ def restore(xd, fd, dominfo = None, paused = False, relocating = False): raise XendError(''Could not read store MFN'') if not is_hvm and handler.console_mfn is None: - raise XendError(''Could not read console MFN'') + raise XendError(''Could not read console MFN'') + + restoreHandler.resume(True, paused, None) + + return dominfo + except Exception, exn: + dominfo.destroy() + log.exception(exn) + raise exn + + +class RestoreHandler: + def __init__(self, fd, colo, dominfo, inputHandler, restore_image): + self.fd = fd + self.colo = colo + self.firsttime = True + self.inputHandler = inputHandler + self.dominfo = dominfo + self.restore_image = restore_image + self.store_port = dominfo.store_port + self.console_port = dominfo.console_port + + def resume(self, finish, paused, child): + fd = self.fd + dominfo = self.dominfo + handler = self.inputHandler + restore_image = self.restore_image restore_image.setCpuid() + dominfo.completeRestore(handler.store_mfn, handler.console_mfn, + self.firsttime) - # xc_restore will wait for source to close connection - - dominfo.completeRestore(handler.store_mfn, handler.console_mfn) + if self.colo and not finish: + # notify master that checkpoint finishes + write_exact(fd, "finish", "failed to write finish done") + buf = read_exact(fd, 6, "failed to read resume flag") + if buf != "resume": + return False - # - # We shouldn''t hold the domains_lock over a waitForDevices - # As this function sometime gets called holding this lock, - # we must release it and re-acquire it appropriately - # from xen.xend import XendDomain - lock = True; - try: - XendDomain.instance().domains_lock.release() - except: - lock = False; - - try: - dominfo.waitForDevices() # Wait for backends to set up - finally: - if lock: - XendDomain.instance().domains_lock.acquire() + if self.firsttime: + lock = True + try: + XendDomain.instance().domains_lock.release() + except: + lock = False + + try: + dominfo.waitForDevices() # Wait for backends to set up + finally: + if lock: + XendDomain.instance().domains_lock.acquire() + if not paused: + dominfo.unpause() + else: + # colo + xc.domain_resume(dominfo.domid, 0) + ResumeDomain(dominfo.domid) - if not paused: - dominfo.unpause() + if self.colo and not finish: + child.tochild.write("resume") + child.tochild.flush() - return dominfo - except Exception, exn: - dominfo.destroy() - log.exception(exn) - raise exn + dominfo.store_port = self.store_port + dominfo.console_port = self.console_port + self.firsttime = False class RestoreInputHandler: def __init__(self): @@ -364,17 +407,25 @@ class RestoreInputHandler: self.console_mfn = None - def handler(self, line, _): + def handler(self, line, child, restoreHandler): + if line == "finish": + # colo + return restoreHandler.resume(False, False, child) + m = re.match(r"^(store-mfn) (\d+)$", line) if m: self.store_mfn = int(m.group(2)) - else: - m = re.match(r"^(console-mfn) (\d+)$", line) - if m: - self.console_mfn = int(m.group(2)) + return True + + m = re.match(r"^(console-mfn) (\d+)$", line) + if m: + self.console_mfn = int(m.group(2)) + return True + + return False -def forkHelper(cmd, fd, inputHandler, closeToChild): +def forkHelper(cmd, fd, inputHandler, closeToChild, restoreHandler): child = xPopen3(cmd, True, -1, [fd]) if closeToChild: @@ -392,7 +443,7 @@ def forkHelper(cmd, fd, inputHandler, closeToChild): else: line = line.rstrip() log.debug(''%s'', line) - inputHandler(line, child.tochild) + inputHandler(line, child, restoreHandler) except IOError, exn: raise XendError(''Error reading from child process for %s: %s'' % -- 1.7.4
Wen Congyang
2013-Jul-11 08:35 UTC
[RFC Patch v2 13/16] xc_domain_save: flush cache before calling callbacks->postcopy()
callbacks->postcopy() may use the fd to transfer something to the other end, so we should flush cache before calling callbacks->postcopy() Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/libxc/xc_domain_save.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c index fbc15e9..b477188 100644 --- a/tools/libxc/xc_domain_save.c +++ b/tools/libxc/xc_domain_save.c @@ -2034,9 +2034,6 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter out: completed = 1; - if ( !rc && callbacks->postcopy ) - callbacks->postcopy(callbacks->data); - /* guest has been resumed. Now we can compress data * at our own pace. */ @@ -2066,6 +2063,9 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter discard_file_cache(xch, io_fd, 1 /* flush */); + if ( !rc && callbacks->postcopy ) + callbacks->postcopy(callbacks->data); + /* Enable compression now, finally */ compressing = (flags & XCFLAGS_CHECKPOINT_COMPRESS); -- 1.7.4
Wen Congyang
2013-Jul-11 08:35 UTC
[RFC Patch v2 14/16] add callback to configure network for colo
In colo mode, we will compare the output packets from PVM and SVM, and decide whether a new checkpoint is needed. We should configure network for colo. For example: copy and forward input packets to SVM, forward output packets from SVM to master. All these works will be auto done in a script. This patch only adds a callback to execute this script. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/python/xen/lowlevel/checkpoint/checkpoint.c | 20 ++++++++++++++++++-- tools/python/xen/remus/save.py | 8 +++++--- tools/remus/remus | 11 ++++++++++- 3 files changed, 33 insertions(+), 6 deletions(-) diff --git a/tools/python/xen/lowlevel/checkpoint/checkpoint.c b/tools/python/xen/lowlevel/checkpoint/checkpoint.c index c5cdd83..ec14b27 100644 --- a/tools/python/xen/lowlevel/checkpoint/checkpoint.c +++ b/tools/python/xen/lowlevel/checkpoint/checkpoint.c @@ -22,6 +22,7 @@ typedef struct { PyObject* suspend_cb; PyObject* postcopy_cb; PyObject* checkpoint_cb; + PyObject* setup_cb; PyThreadState* threadstate; } CheckpointObject; @@ -91,6 +92,8 @@ static PyObject* pycheckpoint_close(PyObject* obj, PyObject* args) self->postcopy_cb = NULL; Py_XDECREF(self->checkpoint_cb); self->checkpoint_cb = NULL; + Py_XDECREF(self->setup_cb); + self->setup_cb = NULL; Py_INCREF(Py_None); return Py_None; @@ -103,6 +106,7 @@ static PyObject* pycheckpoint_start(PyObject* obj, PyObject* args) { PyObject* suspend_cb = NULL; PyObject* postcopy_cb = NULL; PyObject* checkpoint_cb = NULL; + PyObject* setup_cb = NULL; unsigned int interval = 0; unsigned int flags = 0; @@ -110,8 +114,8 @@ static PyObject* pycheckpoint_start(PyObject* obj, PyObject* args) { struct save_callbacks callbacks; int rc; - if (!PyArg_ParseTuple(args, "O|OOOII", &iofile, &suspend_cb, &postcopy_cb, - &checkpoint_cb, &interval, &flags)) + if (!PyArg_ParseTuple(args, "O|OOOOII", &iofile, &suspend_cb, &postcopy_cb, + &checkpoint_cb, &setup_cb, &interval, &flags)) return NULL; self->interval = interval; @@ -120,6 +124,7 @@ static PyObject* pycheckpoint_start(PyObject* obj, PyObject* args) { Py_XINCREF(suspend_cb); Py_XINCREF(postcopy_cb); Py_XINCREF(checkpoint_cb); + Py_XINCREF(setup_cb); fd = PyObject_AsFileDescriptor(iofile); Py_DECREF(iofile); @@ -155,6 +160,15 @@ static PyObject* pycheckpoint_start(PyObject* obj, PyObject* args) { } else self->checkpoint_cb = NULL; + if (setup_cb && setup_cb != Py_None) { + if (!PyCallable_Check(setup_cb)) { + PyErr_SetString(PyExc_TypeError, "setup callback not callable"); + return NULL; + } + self->setup_cb = setup_cb; + } else + self->setup_cb = NULL; + memset(&callbacks, 0, sizeof(callbacks)); callbacks.suspend = suspend_trampoline; callbacks.postcopy = postcopy_trampoline; @@ -180,6 +194,8 @@ static PyObject* pycheckpoint_start(PyObject* obj, PyObject* args) { Py_XDECREF(postcopy_cb); self->checkpoint_cb = NULL; Py_XDECREF(checkpoint_cb); + self->setup_cb = NULL; + Py_XDECREF(self->setup_cb); return NULL; } diff --git a/tools/python/xen/remus/save.py b/tools/python/xen/remus/save.py index 2193061..81e05b9 100644 --- a/tools/python/xen/remus/save.py +++ b/tools/python/xen/remus/save.py @@ -133,7 +133,7 @@ class Keepalive(object): class Saver(object): def __init__(self, domid, fd, suspendcb=None, resumecb=None, - checkpointcb=None, interval=0, flags=0): + checkpointcb=None, setupcb=None, interval=0, flags=0): """Create a Saver object for taking guest checkpoints. domid: name, number or UUID of a running domain fd: a stream to which checkpoint data will be written. @@ -142,6 +142,7 @@ class Saver(object): checkpointcb: callback invoked when a checkpoint is complete. Return True to take another checkpoint, or False to stop. flags: Remus flags to be passed to xc_domain_save + setupcb: callback invoked to configure network for colo """ self.fd = fd self.suspendcb = suspendcb @@ -149,6 +150,7 @@ class Saver(object): self.checkpointcb = checkpointcb self.interval = interval self.flags = flags + self.setupcb = setupcb self.vm = vm.VM(domid) @@ -166,8 +168,8 @@ class Saver(object): try: self.checkpointer.open(self.vm.domid) self.checkpointer.start(self.fd, self.suspendcb, self.resumecb, - self.checkpointcb, self.interval, - self.flags) + self.checkpointcb, self.setupcb, + self.interval, self.flags) except xen.lowlevel.checkpoint.error, e: raise CheckpointError(e) finally: diff --git a/tools/remus/remus b/tools/remus/remus index d5178cd..7be7fdd 100644 --- a/tools/remus/remus +++ b/tools/remus/remus @@ -164,6 +164,15 @@ def run(cfg): if closure.cmd == ''r2'': die() + def setup(): + ''setup network'' + if cfg.colo: + for vif in dom.vifs: + print "setup %s" % vif.dev + print util.runcmd([''/etc/xen/scripts/network-colo'', ''master'', ''install'', vif.dev, ''eth0'']) + return True + return False + def commit(): ''commit network buffer'' if closure.cmd == ''c'': @@ -199,7 +208,7 @@ def run(cfg): rc = 0 checkpointer = save.Saver(cfg.domid, fd, postsuspend, preresume, commit, - interval, cfg.flags) + setup, interval, cfg.flags) try: checkpointer.start() -- 1.7.4
Wen Congyang
2013-Jul-11 08:35 UTC
[RFC Patch v2 15/16] xc_domain_save: implement save_callbacks for colo
Add a new save callbacks: 1. post_sendstate(): SVM will run only when XC_SAVE_ID_LAST_CHECKPOINT is sent to slaver. But we only sent XC_SAVE_ID_LAST_CHECKPOINT when we do live migration now. Add this callback, and we can send it in this callback. Update some callbacks for colo: 1. suspend(): In colo mode, both PVM and SVM are running. So we should suspend both PVM and SVM. Communicate with slaver like this: a. write "continue" to notify slaver to suspend SVM b. suspend PVM and SVM c. slaver writes "suspend" to tell master that SVM is suspended 2. postcopy(): In colo mode, both PVM and SVM are running, and we have suspended both PVM and SVM. So we should resume PVM and SVM Communicate with slaver like this: a. write "resume" to notify slaver to resume SVM b. resume PVM and SVM c. slaver writes "resume" to tell master that SVM is resumed 3. checkpoint(): In colo mode, we do a new checkpoint only when output packet from PVM and SVM is different. We will block in this callback and return when a output packet is different. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/libxc/xc_domain_save.c | 17 ++ tools/libxc/xenguest.h | 3 + tools/python/xen/lowlevel/checkpoint/checkpoint.c | 302 ++++++++++++++++++++- tools/python/xen/lowlevel/checkpoint/checkpoint.h | 1 + 4 files changed, 319 insertions(+), 4 deletions(-) diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c index b477188..8f84c9b 100644 --- a/tools/libxc/xc_domain_save.c +++ b/tools/libxc/xc_domain_save.c @@ -1785,6 +1785,23 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter } } + /* Flush last write and discard cache for file. */ + if ( outbuf_flush(xch, ob, io_fd) < 0 ) { + PERROR("Error when flushing output buffer"); + rc = 1; + } + + discard_file_cache(xch, io_fd, 1 /* flush */); + + if ( callbacks->post_sendstate ) + { + if ( callbacks->post_sendstate(callbacks->data) < 0) + { + PERROR("Error: post_sendstate()\n"); + goto out; + } + } + /* Zero terminate */ i = 0; if ( wrexact(io_fd, &i, sizeof(int)) ) diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h index 4bb444a..9d7d03c 100644 --- a/tools/libxc/xenguest.h +++ b/tools/libxc/xenguest.h @@ -72,6 +72,9 @@ struct save_callbacks { */ int (*toolstack_save)(uint32_t domid, uint8_t **buf, uint32_t *len, void *data); + /* called before Zero terminate is sent */ + int (*post_sendstate)(void *data); + /* to be provided as the last argument to each callback function */ void* data; }; diff --git a/tools/python/xen/lowlevel/checkpoint/checkpoint.c b/tools/python/xen/lowlevel/checkpoint/checkpoint.c index ec14b27..28bdb23 100644 --- a/tools/python/xen/lowlevel/checkpoint/checkpoint.c +++ b/tools/python/xen/lowlevel/checkpoint/checkpoint.c @@ -1,14 +1,22 @@ /* python bridge to checkpointing API */ #include <Python.h> +#include <sys/wait.h> #include <xenstore.h> #include <xenctrl.h> +#include <xc_private.h> +#include <xg_save_restore.h> #include "checkpoint.h" #define PKG "xen.lowlevel.checkpoint" +#define COMP_IOC_MAGIC ''k'' +#define COMP_IOCTWAIT _IO(COMP_IOC_MAGIC, 0) +#define COMP_IOCTFLUSH _IO(COMP_IOC_MAGIC, 1) +#define COMP_IOCTRESUME _IO(COMP_IOC_MAGIC, 2) + static PyObject* CheckpointError; typedef struct { @@ -25,11 +33,15 @@ typedef struct { PyObject* setup_cb; PyThreadState* threadstate; + int colo; + int first_time; + int dev_fd; } CheckpointObject; static int suspend_trampoline(void* data); static int postcopy_trampoline(void* data); static int checkpoint_trampoline(void* data); +static int post_sendstate_trampoline(void *data); static PyObject* Checkpoint_new(PyTypeObject* type, PyObject* args, PyObject* kwargs) @@ -169,10 +181,17 @@ static PyObject* pycheckpoint_start(PyObject* obj, PyObject* args) { } else self->setup_cb = NULL; + if (flags & CHECKPOINT_FLAGS_COLO) + self->colo = 1; + else + self->colo = 0; + self->first_time = 1; + memset(&callbacks, 0, sizeof(callbacks)); callbacks.suspend = suspend_trampoline; callbacks.postcopy = postcopy_trampoline; callbacks.checkpoint = checkpoint_trampoline; + callbacks.post_sendstate = post_sendstate_trampoline; callbacks.data = self; self->threadstate = PyEval_SaveThread(); @@ -279,6 +298,196 @@ PyMODINIT_FUNC initcheckpoint(void) { block_timer(); } +/* colo functions */ + +/* master slaver comment + * "continue" ===> + * <=== "suspend" guest is suspended + */ +static int notify_slaver_suspend(CheckpointObject *self) +{ + int fd = self->cps.fd; + + if (self->first_time == 1) + return 0; + + return write_exact(fd, "continue", 8); +} + +static int wait_slaver_suspend(CheckpointObject *self) +{ + int fd = self->cps.fd; + xc_interface *xch = self->cps.xch; + char buf[8]; + + if (self->first_time == 1) + return 0; + + if ( read_exact(fd, buf, 7) < 0) { + PERROR("read: suspend"); + return -1; + } + + buf[7] = ''\0''; + if (strcmp(buf, "suspend")) { + PERROR("read \"%s\", expect \"suspend\"", buf); + return -1; + } + + return 0; +} + +static int notify_slaver_start_checkpoint(CheckpointObject *self) +{ + int fd = self->cps.fd; + xc_interface *xch = self->cps.xch; + + if (self->first_time == 1) + return 0; + + if ( write_exact(fd, "start", 5) < 0) { + PERROR("write start"); + return -1; + } + + return 0; +} + +/* + * master slaver + * <==== "finish" + * flush packets + * "resume" ====> + * resume vm resume vm + * <==== "resume" + */ +static int notify_slaver_resume(CheckpointObject *self) +{ + int fd = self->cps.fd; + xc_interface *xch = self->cps.xch; + char buf[7]; + + /* wait slaver to finish update memory, device state... */ + if ( read_exact(fd, buf, 6) < 0) { + PERROR("read: finish"); + return -1; + } + + buf[6] = ''\0''; + if (strcmp(buf, "finish")) { + ERROR("read \"%s\", expect \"finish\"", buf); + return -1; + } + + if (!self->first_time) + /* flush queued packets now */ + ioctl(self->dev_fd, COMP_IOCTFLUSH); + + /* notify slaver to resume vm*/ + if (write_exact(fd, "resume", 6) < 0) { + PERROR("write: resume"); + return -1; + } + + return 0; +} + +static int install_fw_network(CheckpointObject *self) +{ + int rc; + PyObject* result; + + PyEval_RestoreThread(self->threadstate); + result = PyObject_CallFunction(self->setup_cb, NULL); + self->threadstate = PyEval_SaveThread(); + + if (!result) + return -1; + + if (result == Py_None || PyObject_IsTrue(result)) + rc = 0; + else + rc = -1; + + Py_DECREF(result); + + return rc; +} + +static int wait_slaver_resume(CheckpointObject *self) +{ + int fd = self->cps.fd; + xc_interface *xch = self->cps.xch; + char buf[7]; + + if (read_exact(fd, buf, 6) < 0) { + PERROR("read resume"); + return -1; + } + + buf[6] = ''\0''; + if (strcmp(buf, "resume")) { + ERROR("read \"%s\", expect \"resume\"", buf); + return -1; + } + + return 0; +} + +static int colo_postresume(CheckpointObject *self) +{ + int rc; + int dev_fd = self->dev_fd; + + rc = wait_slaver_resume(self); + if (rc < 0) + return rc; + + if (self->first_time) { + rc = install_fw_network(self); + if (rc < 0) { + fprintf(stderr, "install network fails\n"); + return rc; + } + } else { + ioctl(dev_fd, COMP_IOCTRESUME); + } + + return 0; +} + +static int pre_checkpoint(CheckpointObject *self) +{ + xc_interface *xch = self->cps.xch; + + if (!self->first_time) + return 0; + + self->dev_fd = open("/dev/HA_compare", O_RDWR); + if (self->dev_fd < 0) { + PERROR("opening /dev/HA_compare fails"); + return -1; + } + + return 0; +} + +static void wait_new_checkpoint(CheckpointObject *self) +{ + int dev_fd = self->dev_fd; + int err; + + while (1) { + err = ioctl(dev_fd, COMP_IOCTWAIT); + if (err == 0) + break; + + if (err == -1 && errno != ERESTART && errno != ETIME) { + fprintf(stderr, "ioctl() returns -1, errno: %d\n", errno); + } + } +} + /* private functions */ /* bounce C suspend call into python equivalent. @@ -289,6 +498,13 @@ static int suspend_trampoline(void* data) PyObject* result; + if (self->colo) { + if (notify_slaver_suspend(self) < 0) { + fprintf(stderr, "nofitying slaver suspend fails\n"); + return 0; + } + } + /* call default suspend function, then python hook if available */ if (self->armed) { if (checkpoint_wait(&self->cps) < 0) { @@ -307,8 +523,16 @@ static int suspend_trampoline(void* data) } } + /* suspend_cb() should be called after both sides are suspended */ + if (self->colo) { + if (wait_slaver_suspend(self) < 0) { + fprintf(stderr, "waiting slaver suspend fails\n"); + return 0; + } + } + if (!self->suspend_cb) - return 1; + goto start_checkpoint; PyEval_RestoreThread(self->threadstate); result = PyObject_CallFunction(self->suspend_cb, NULL); @@ -319,12 +543,32 @@ static int suspend_trampoline(void* data) if (result == Py_None || PyObject_IsTrue(result)) { Py_DECREF(result); - return 1; + goto start_checkpoint; } Py_DECREF(result); return 0; + +start_checkpoint: + if (self->colo) { + if (notify_slaver_start_checkpoint(self) < 0) { + fprintf(stderr, "nofitying slaver to start checkpoint fails\n"); + return 0; + } + + /* PVM is suspended first when doing live migration, + * and then it is suspended for a new checkpoint. + */ + if (self->first_time == 1) + /* live migration */ + self->first_time = 2; + else if (self->first_time == 2) + /* the first checkpoint */ + self->first_time = 0; + } + + return 1; } static int postcopy_trampoline(void* data) @@ -334,6 +578,13 @@ static int postcopy_trampoline(void* data) PyObject* result; int rc = 0; + if (self->colo) { + if (notify_slaver_resume(self) < 0) { + fprintf(stderr, "nofitying slaver resume fails\n"); + return 0; + } + } + if (!self->postcopy_cb) goto resume; @@ -352,6 +603,13 @@ static int postcopy_trampoline(void* data) return 0; } + if (self->colo) { + if (colo_postresume(self) < 0) { + fprintf(stderr, "postresume fails\n"); + return 0; + } + } + return rc; } @@ -366,8 +624,15 @@ static int checkpoint_trampoline(void* data) return -1; } + if (self->colo) { + if (pre_checkpoint(self) < 0) { + fprintf(stderr, "pre_checkpoint() fails\n"); + return -1; + } + } + if (!self->checkpoint_cb) - return 0; + goto wait_checkpoint; PyEval_RestoreThread(self->threadstate); result = PyObject_CallFunction(self->checkpoint_cb, NULL); @@ -378,10 +643,39 @@ static int checkpoint_trampoline(void* data) if (result == Py_None || PyObject_IsTrue(result)) { Py_DECREF(result); - return 1; + goto wait_checkpoint; } Py_DECREF(result); return 0; + +wait_checkpoint: + if (self->colo) { + wait_new_checkpoint(self); + } + + fprintf(stderr, "\n\nnew checkpoint..........\n"); + + return 1; +} + +static int post_sendstate_trampoline(void* data) +{ + CheckpointObject *self = data; + int fd = self->cps.fd; + int i = XC_SAVE_ID_LAST_CHECKPOINT; + + if (!self->colo) + return 0; + + /* In colo mode, guest is running on slaver side, so we should + * send XC_SAVE_ID_LAST_CHECKPOINT to slaver. + */ + if (write_exact(fd, &i, sizeof(int)) < 0) { + fprintf(stderr, "writing XC_SAVE_ID_LAST_CHECKPOINT fails\n"); + return -1; + } + + return 0; } diff --git a/tools/python/xen/lowlevel/checkpoint/checkpoint.h b/tools/python/xen/lowlevel/checkpoint/checkpoint.h index 187d9d7..96fc949 100644 --- a/tools/python/xen/lowlevel/checkpoint/checkpoint.h +++ b/tools/python/xen/lowlevel/checkpoint/checkpoint.h @@ -41,6 +41,7 @@ typedef struct { } checkpoint_state; #define CHECKPOINT_FLAGS_COMPRESSION 1 +#define CHECKPOINT_FLAGS_COLO 2 char* checkpoint_error(checkpoint_state* s); void checkpoint_init(checkpoint_state* s); -- 1.7.4
Add a new option --colo to the command remus. We will ignore the options: --time, -i, --no-net when --colo is specified. In colo mode, we will write new signature "GuestColoRestore". If the xen-tool in secondary machine does not support colo, it will reject this signature, and the command remus will fail. Signed-off-by: Ye Wei <wei.ye1987@gmail.com> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> --- tools/python/xen/remus/image.py | 8 ++++++-- tools/python/xen/remus/save.py | 7 +++++-- tools/remus/remus | 20 +++++++++++++++++--- 3 files changed, 28 insertions(+), 7 deletions(-) diff --git a/tools/python/xen/remus/image.py b/tools/python/xen/remus/image.py index b79d1e5..6bae8f4 100644 --- a/tools/python/xen/remus/image.py +++ b/tools/python/xen/remus/image.py @@ -5,6 +5,7 @@ import logging, struct import vm SIGNATURE = ''LinuxGuestRecord'' +COLO_SIGNATURE = "GuestColoRestore" LONGLEN = struct.calcsize(''L'') INTLEN = struct.calcsize(''i'') PAGE_SIZE = 4096 @@ -189,9 +190,12 @@ def parseheader(header): "parses a header sexpression" return vm.parsedominfo(vm.strtosxpr(header)) -def makeheader(dominfo): +def makeheader(dominfo, colo): "create an image header from a VM dominfo sxpr" - items = [SIGNATURE] + if colo: + items = [COLO_SIGNATURE] + else: + items = [SIGNATURE] sxpr = vm.sxprtostr(dominfo) items.append(struct.pack(''!i'', len(sxpr))) items.append(sxpr) diff --git a/tools/python/xen/remus/save.py b/tools/python/xen/remus/save.py index 81e05b9..45be172 100644 --- a/tools/python/xen/remus/save.py +++ b/tools/python/xen/remus/save.py @@ -133,7 +133,8 @@ class Keepalive(object): class Saver(object): def __init__(self, domid, fd, suspendcb=None, resumecb=None, - checkpointcb=None, setupcb=None, interval=0, flags=0): + checkpointcb=None, setupcb=None, interval=0, flags=0, + colo=False): """Create a Saver object for taking guest checkpoints. domid: name, number or UUID of a running domain fd: a stream to which checkpoint data will be written. @@ -143,6 +144,7 @@ class Saver(object): True to take another checkpoint, or False to stop. flags: Remus flags to be passed to xc_domain_save setupcb: callback invoked to configure network for colo + colo: use colo mode """ self.fd = fd self.suspendcb = suspendcb @@ -151,6 +153,7 @@ class Saver(object): self.interval = interval self.flags = flags self.setupcb = setupcb + self.colo = colo self.vm = vm.VM(domid) @@ -159,7 +162,7 @@ class Saver(object): def start(self): vm.getshadowmem(self.vm) - hdr = image.makeheader(self.vm.dominfo) + hdr = image.makeheader(self.vm.dominfo, self.colo) self.fd.write(hdr) self.fd.flush() diff --git a/tools/remus/remus b/tools/remus/remus index 7be7fdd..592c8cc 100644 --- a/tools/remus/remus +++ b/tools/remus/remus @@ -18,6 +18,7 @@ class CfgException(Exception): pass class Cfg(object): REMUS_FLAGS_COMPRESSION = 1 + REMUS_FLAGS_COLO = 2 def __init__(self): # must be set @@ -30,6 +31,7 @@ class Cfg(object): self.netbuffer = True self.flags = self.REMUS_FLAGS_COMPRESSION self.timer = False + self.colo = False parser = optparse.OptionParser() parser.usage = ''%prog [options] domain [destination]'' @@ -46,6 +48,8 @@ class Cfg(object): help=''run without checkpoint compression'') parser.add_option('''', ''--timer'', dest=''timer'', action=''store_true'', help=''force pause at checkpoint interval (experimental)'') + parser.add_option('''', ''--colo'', dest=''colo'', action=''store_true'', + help=''use colo checkpointing (experimental)'') self.parser = parser def usage(self): @@ -66,6 +70,12 @@ class Cfg(object): self.flags &= ~self.REMUS_FLAGS_COMPRESSION if opts.timer: self.timer = True + if opts.colo: + self.interval = 0 + self.netbuffer = False + self.timer = True + self.colo = True + self.flags |= self.REMUS_FLAGS_COLO if not args: raise CfgException(''Missing domain'') @@ -123,8 +133,12 @@ def run(cfg): if not cfg.nullremus: for disk in dom.disks: try: - bufs.append(ReplicatedDisk(disk)) - disk.init(''r'') + rdisk = ReplicatedDisk(disk) + bufs.append(rdisk) + if cfg.colo: + rdisk.init(''c'') + else: + rdisk.init(''r'') except ReplicatedDiskException, e: print e continue @@ -208,7 +222,7 @@ def run(cfg): rc = 0 checkpointer = save.Saver(cfg.domid, fd, postsuspend, preresume, commit, - setup, interval, cfg.flags) + setup, interval, cfg.flags, cfg.colo) try: checkpointer.start() -- 1.7.4
Andrew Cooper
2013-Jul-11 09:37 UTC
Re: [RFC Patch v2 00/16] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
On 11/07/13 09:35, Wen Congyang wrote:> Virtual machine (VM) replication is a well known technique for providing > application-agnostic software-implemented hardware fault tolerance - > "non-stop service". Currently, remus provides this function, but it buffers > all output packets, and the latency is unacceptable. > > In xen summit 2012, We introduce a new VM replication solution: colo > (COarse-grain LOck-stepping virtual machine). The presentation is in > the following URL: > http://www.slideshare.net/xen_com_mgr/colo-coarsegrain-lockstepping-virtual-machines-for-nonstop-service > > Here is the summary of the solution: > >From the client''s point of view, as long as the client observes identical > responses from the primary and secondary VMs, according to the service > semantics, then the secondary VM(SVM) is a valid replica of the primary > VM(PVM), and can successfully take over when a hardware failure of the > PVM is detected.How set in stone are you about the terms PVM and SVM? SVM already has a specific meaning in Xen, being AMD Software Virtual Machine extensions which allow for HVM guests. As a lesser problem, PVM is sometimes used to mean PV, as a mirror of HVM. ~Andrew> > This patchset is RFC, and implements the frame of colo: > 1. Both PVM and SVM are running > 2. do checkpoint only when the output packets from PVM and SVM are different > 3. cache write requests from SVM > > ChangeLog from v1 to v2: > 1. update block-remus to support colo > 2. split large patch to small one > 3. fix some bugs > 4. add a new hypercall for colo > > Changelog: > Patch 1: optimize the dirty pages transfer speed. > Patch 2-3: allow SVM running after checkpoint > Patch 4-5: modification for colo on the master side(wait a new checkpoint, > communicate with slaver when doing checkoint) > Patch 6-7: implement colo''s user interface > > > Wen Congyang (16): > xen: introduce new hypercall to reset vcpu > block-remus: introduce colo mode > block-remus: introduce a interface to allow the user specify which > mode the backup end uses > dominfo.completeRestore() will be called more than once in colo mode > xc_domain_restore: introduce restore_callbacks for colo > colo: implement restore_callbacks init()/free() > colo: implement restore_callbacks get_page() > colo: implement restore_callbacks flush_memory > colo: implement restore_callbacks update_p2m() > colo: implement restore_callbacks finish_restore() > xc_restore: implement for colo > XendCheckpoint: implement colo > xc_domain_save: flush cache before calling callbacks->postcopy() > add callback to configure network for colo > xc_domain_save: implement save_callbacks for colo > remus: implement colo mode > > tools/blktap2/drivers/block-remus.c | 188 ++++- > tools/libxc/Makefile | 8 +- > tools/libxc/xc_domain_restore.c | 264 ++++-- > tools/libxc/xc_domain_restore_colo.c | 939 +++++++++++++++++++++ > tools/libxc/xc_domain_save.c | 23 +- > tools/libxc/xc_save_restore_colo.h | 14 + > tools/libxc/xenguest.h | 51 ++ > tools/libxl/Makefile | 2 +- > tools/python/xen/lowlevel/checkpoint/checkpoint.c | 322 +++++++- > tools/python/xen/lowlevel/checkpoint/checkpoint.h | 1 + > tools/python/xen/remus/device.py | 8 + > tools/python/xen/remus/image.py | 8 +- > tools/python/xen/remus/save.py | 13 +- > tools/python/xen/xend/XendCheckpoint.py | 127 ++- > tools/python/xen/xend/XendDomainInfo.py | 13 +- > tools/remus/remus | 28 +- > tools/xcutils/Makefile | 4 +- > tools/xcutils/xc_restore.c | 36 +- > xen/arch/x86/domain.c | 57 ++ > xen/arch/x86/x86_64/entry.S | 4 + > xen/include/public/xen.h | 1 + > 21 files changed, 1947 insertions(+), 164 deletions(-) > create mode 100644 tools/libxc/xc_domain_restore_colo.c > create mode 100644 tools/libxc/xc_save_restore_colo.h >
Ian Campbell
2013-Jul-11 09:40 UTC
Re: [RFC Patch v2 10/16] colo: implement restore_callbacks finish_restore()
On Thu, 2013-07-11 at 16:35 +0800, Wen Congyang wrote:> [...] > diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile > index 70994b9..92d11af 100644 > --- a/tools/libxc/Makefile > +++ b/tools/libxc/Makefile > @@ -49,7 +49,7 @@ GUEST_SRCS-y += xc_nomigrate.c > endif > > vpath %.c ../../xen/common/libelf > -CFLAGS += -I../../xen/common/libelf > +CFLAGS += -I../../xen/common/libelf -I../xenstoreWe have avoided needing libxc to speak xenstore so far. It looks like you only use xs_suspend_evtchn_port, in which case you could just pass this from libxc''s caller.> [...]> + if (colo_data->first_time) { > + sleep(10);This can''t be right, can it? Ian.
Ian Campbell
2013-Jul-11 09:40 UTC
Re: [RFC Patch v2 00/16] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
On Thu, 2013-07-11 at 16:35 +0800, Wen Congyang wrote:> [...] > XendCheckpoint: implement coloxend has been deprecated for two releases now. I''m afraid any new functionality of this magnitude is going to need to integrate with libxl instead. It''s a shame that the libxl/xl support for Remus appears to have stalled. Ian.
Andrew Cooper
2013-Jul-11 09:44 UTC
Re: [RFC Patch v2 01/16] xen: introduce new hypercall to reset vcpu
On 11/07/13 09:35, Wen Congyang wrote:> In colo mode, SVM is running, and it will create pagetable, use gdt... > When we do a new checkpoint, we may need to rollback all this operations. > This new hypercall will do this. > > Signed-off-by: Ye Wei <wei.ye1987@gmail.com> > Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> > Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> > --- > xen/arch/x86/domain.c | 57 +++++++++++++++++++++++++++++++++++++++++++ > xen/arch/x86/x86_64/entry.S | 4 +++ > xen/include/public/xen.h | 1 + > 3 files changed, 62 insertions(+), 0 deletions(-) > > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c > index 874742c..709f77f 100644 > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -1930,6 +1930,63 @@ int domain_relinquish_resources(struct domain *d) > return 0; > } > > +int do_reset_vcpu_op(unsigned long domid) > +{ > + struct vcpu *v; > + struct domain *d; > + int ret; > + > + if ( domid == DOMID_SELF ) > + /* We can''t destroy outself pagetables */"We can''t destroy our own pagetables"> + return -EINVAL; > + > + if ( (d = rcu_lock_domain_by_id(domid)) == NULL ) > + return -EINVAL; > + > + BUG_ON(!cpumask_empty(d->domain_dirty_cpumask));This looks bogus. What guarantee is there (other than the toolstack issuing appropriate hypercalls in an appropriate order) that this is actually true.> + domain_pause(d); > + > + if ( d->arch.relmem == RELMEM_not_started ) > + { > + for_each_vcpu ( d, v ) > + { > + /* Drop the in-use references to page-table bases. */ > + ret = vcpu_destroy_pagetables(v); > + if ( ret ) > + return ret; > + > + unmap_vcpu_info(v); > + v->is_initialised = 0; > + } > + > + if ( !is_hvm_domain(d) ) > + { > + for_each_vcpu ( d, v ) > + { > + /* > + * Relinquish GDT mappings. No need for explicit unmapping of the > + * LDT as it automatically gets squashed with the guest mappings. > + */ > + destroy_gdt(v); > + } > + > + if ( d->arch.pv_domain.pirq_eoi_map != NULL ) > + { > + unmap_domain_page_global(d->arch.pv_domain.pirq_eoi_map); > + put_page_and_type( > + mfn_to_page(d->arch.pv_domain.pirq_eoi_map_mfn)); > + d->arch.pv_domain.pirq_eoi_map = NULL; > + d->arch.pv_domain.auto_unmask = 0; > + } > + } > + } > + > + domain_unpause(d); > + rcu_unlock_domain(d); > + > + return 0; > +} > + > void arch_dump_domain_info(struct domain *d) > { > paging_dump_domain_info(d); > diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S > index 5beeccb..0e4dde4 100644 > --- a/xen/arch/x86/x86_64/entry.S > +++ b/xen/arch/x86/x86_64/entry.S > @@ -762,6 +762,8 @@ ENTRY(hypercall_table) > .quad do_domctl > .quad do_kexec_op > .quad do_tmem_op > + .quad do_ni_hypercall /* reserved for XenClient */ > + .quad do_reset_vcpu_op /* 40 */ > .rept __HYPERVISOR_arch_0-((.-hypercall_table)/8) > .quad do_ni_hypercall > .endr > @@ -810,6 +812,8 @@ ENTRY(hypercall_args_table) > .byte 1 /* do_domctl */ > .byte 2 /* do_kexec */ > .byte 1 /* do_tmem_op */ > + .byte 0 /* do_ni_hypercall */ > + .byte 1 /* do_reset_vcpu_op */ /* 40 */ > .rept __HYPERVISOR_arch_0-(.-hypercall_args_table) > .byte 0 /* do_ni_hypercall */ > .endr > diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h > index 3cab74f..696f4a3 100644 > --- a/xen/include/public/xen.h > +++ b/xen/include/public/xen.h > @@ -101,6 +101,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_ulong_t); > #define __HYPERVISOR_kexec_op 37 > #define __HYPERVISOR_tmem_op 38 > #define __HYPERVISOR_xc_reserved_op 39 /* reserved for XenClient */ > +#define __HYPERVISOR_reset_vcpu_op 40Why can this not be a domctl subop ? ~Andrew> > /* Architecture-specific hypercall definitions. */ > #define __HYPERVISOR_arch_0 48
Wen Congyang
2013-Jul-11 09:54 UTC
Re: [RFC Patch v2 10/16] colo: implement restore_callbacks finish_restore()
At 07/11/2013 05:40 PM, Ian Campbell Wrote:> On Thu, 2013-07-11 at 16:35 +0800, Wen Congyang wrote: >> [...] >> diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile >> index 70994b9..92d11af 100644 >> --- a/tools/libxc/Makefile >> +++ b/tools/libxc/Makefile >> @@ -49,7 +49,7 @@ GUEST_SRCS-y += xc_nomigrate.c >> endif >> >> vpath %.c ../../xen/common/libelf >> -CFLAGS += -I../../xen/common/libelf >> +CFLAGS += -I../../xen/common/libelf -I../xenstore > > We have avoided needing libxc to speak xenstore so far.OK. I will fix it in the next version.> > It looks like you only use xs_suspend_evtchn_port, in which case you > could just pass this from libxc''s caller. > >> [...] > >> + if (colo_data->first_time) { >> + sleep(10); > > This can''t be right, can it?Yes, just to wait suspend evtchn port. I will cleanup it in the next version. Thanks> > Ian. > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
Wen Congyang
2013-Jul-11 09:58 UTC
Re: [RFC Patch v2 01/16] xen: introduce new hypercall to reset vcpu
At 07/11/2013 05:44 PM, Andrew Cooper Wrote:> On 11/07/13 09:35, Wen Congyang wrote: >> In colo mode, SVM is running, and it will create pagetable, use gdt... >> When we do a new checkpoint, we may need to rollback all this operations. >> This new hypercall will do this. >> >> Signed-off-by: Ye Wei <wei.ye1987@gmail.com> >> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> >> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> >> --- >> xen/arch/x86/domain.c | 57 +++++++++++++++++++++++++++++++++++++++++++ >> xen/arch/x86/x86_64/entry.S | 4 +++ >> xen/include/public/xen.h | 1 + >> 3 files changed, 62 insertions(+), 0 deletions(-) >> >> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c >> index 874742c..709f77f 100644 >> --- a/xen/arch/x86/domain.c >> +++ b/xen/arch/x86/domain.c >> @@ -1930,6 +1930,63 @@ int domain_relinquish_resources(struct domain *d) >> return 0; >> } >> >> +int do_reset_vcpu_op(unsigned long domid) >> +{ >> + struct vcpu *v; >> + struct domain *d; >> + int ret; >> + >> + if ( domid == DOMID_SELF ) >> + /* We can''t destroy outself pagetables */ > > "We can''t destroy our own pagetables" > >> + return -EINVAL; >> + >> + if ( (d = rcu_lock_domain_by_id(domid)) == NULL ) >> + return -EINVAL; >> + >> + BUG_ON(!cpumask_empty(d->domain_dirty_cpumask)); > > This looks bogus. What guarantee is there (other than the toolstack > issuing appropriate hypercalls in an appropriate order) that this is > actually true.Hmm, these codes are copied from this function: domain_relinquish_resources()> >> + domain_pause(d); >> + >> + if ( d->arch.relmem == RELMEM_not_started ) >> + { >> + for_each_vcpu ( d, v ) >> + { >> + /* Drop the in-use references to page-table bases. */ >> + ret = vcpu_destroy_pagetables(v); >> + if ( ret ) >> + return ret; >> + >> + unmap_vcpu_info(v); >> + v->is_initialised = 0; >> + } >> + >> + if ( !is_hvm_domain(d) ) >> + { >> + for_each_vcpu ( d, v ) >> + { >> + /* >> + * Relinquish GDT mappings. No need for explicit unmapping of the >> + * LDT as it automatically gets squashed with the guest mappings. >> + */ >> + destroy_gdt(v); >> + } >> + >> + if ( d->arch.pv_domain.pirq_eoi_map != NULL ) >> + { >> + unmap_domain_page_global(d->arch.pv_domain.pirq_eoi_map); >> + put_page_and_type( >> + mfn_to_page(d->arch.pv_domain.pirq_eoi_map_mfn)); >> + d->arch.pv_domain.pirq_eoi_map = NULL; >> + d->arch.pv_domain.auto_unmask = 0; >> + } >> + } >> + } >> + >> + domain_unpause(d); >> + rcu_unlock_domain(d); >> + >> + return 0; >> +} >> + >> void arch_dump_domain_info(struct domain *d) >> { >> paging_dump_domain_info(d); >> diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S >> index 5beeccb..0e4dde4 100644 >> --- a/xen/arch/x86/x86_64/entry.S >> +++ b/xen/arch/x86/x86_64/entry.S >> @@ -762,6 +762,8 @@ ENTRY(hypercall_table) >> .quad do_domctl >> .quad do_kexec_op >> .quad do_tmem_op >> + .quad do_ni_hypercall /* reserved for XenClient */ >> + .quad do_reset_vcpu_op /* 40 */ >> .rept __HYPERVISOR_arch_0-((.-hypercall_table)/8) >> .quad do_ni_hypercall >> .endr >> @@ -810,6 +812,8 @@ ENTRY(hypercall_args_table) >> .byte 1 /* do_domctl */ >> .byte 2 /* do_kexec */ >> .byte 1 /* do_tmem_op */ >> + .byte 0 /* do_ni_hypercall */ >> + .byte 1 /* do_reset_vcpu_op */ /* 40 */ >> .rept __HYPERVISOR_arch_0-(.-hypercall_args_table) >> .byte 0 /* do_ni_hypercall */ >> .endr >> diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h >> index 3cab74f..696f4a3 100644 >> --- a/xen/include/public/xen.h >> +++ b/xen/include/public/xen.h >> @@ -101,6 +101,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_ulong_t); >> #define __HYPERVISOR_kexec_op 37 >> #define __HYPERVISOR_tmem_op 38 >> #define __HYPERVISOR_xc_reserved_op 39 /* reserved for XenClient */ >> +#define __HYPERVISOR_reset_vcpu_op 40 > > Why can this not be a domctl subop ?Hmm, I will do it Thanks Wen Congyang> > ~Andrew > >> >> /* Architecture-specific hypercall definitions. */ >> #define __HYPERVISOR_arch_0 48 > >
Ian Campbell
2013-Jul-11 10:01 UTC
Re: [RFC Patch v2 01/16] xen: introduce new hypercall to reset vcpu
On Thu, 2013-07-11 at 17:58 +0800, Wen Congyang wrote:> >> + BUG_ON(!cpumask_empty(d->domain_dirty_cpumask)); > > > > This looks bogus. What guarantee is there (other than the toolstack > > issuing appropriate hypercalls in an appropriate order) that this is > > actually true. > > Hmm, these codes are copied from this function: > domain_relinquish_resources()That''s called under very different circumstances though. Specifically during domain teardown when the vcpus are necessarily all quiescent.
Andrew Cooper
2013-Jul-11 13:43 UTC
Re: [RFC Patch v2 13/16] xc_domain_save: flush cache before calling callbacks->postcopy()
On 11/07/13 09:35, Wen Congyang wrote:> callbacks->postcopy() may use the fd to transfer something to the > other end, so we should flush cache before calling callbacks->postcopy() > > Signed-off-by: Ye Wei <wei.ye1987@gmail.com> > Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> > Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> > ---This looks like it is a bugfix on its own, so perhaps might be better submitted as individual fix, rather than being mixed in with a huge series for new functionaltiy ~Andrew> tools/libxc/xc_domain_save.c | 6 +++--- > 1 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c > index fbc15e9..b477188 100644 > --- a/tools/libxc/xc_domain_save.c > +++ b/tools/libxc/xc_domain_save.c > @@ -2034,9 +2034,6 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter > out: > completed = 1; > > - if ( !rc && callbacks->postcopy ) > - callbacks->postcopy(callbacks->data); > - > /* guest has been resumed. Now we can compress data > * at our own pace. > */ > @@ -2066,6 +2063,9 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter > > discard_file_cache(xch, io_fd, 1 /* flush */); > > + if ( !rc && callbacks->postcopy ) > + callbacks->postcopy(callbacks->data); > + > /* Enable compression now, finally */ > compressing = (flags & XCFLAGS_CHECKPOINT_COMPRESS); >
Andrew Cooper
2013-Jul-11 13:52 UTC
Re: [RFC Patch v2 15/16] xc_domain_save: implement save_callbacks for colo
On 11/07/13 09:35, Wen Congyang wrote:> Add a new save callbacks: > 1. post_sendstate(): SVM will run only when XC_SAVE_ID_LAST_CHECKPOINT is > sent to slaver. But we only sent XC_SAVE_ID_LAST_CHECKPOINT when we do > live migration now. Add this callback, and we can send it in this > callback. > > Update some callbacks for colo: > 1. suspend(): In colo mode, both PVM and SVM are running. So we should suspend > both PVM and SVM. > Communicate with slaver like this: > a. write "continue" to notify slaver to suspend SVM > b. suspend PVM and SVM > c. slaver writes "suspend" to tell master that SVM is suspended > 2. postcopy(): In colo mode, both PVM and SVM are running, and we have suspended > both PVM and SVM. So we should resume PVM and SVM > Communicate with slaver like this: > a. write "resume" to notify slaver to resume SVM > b. resume PVM and SVM > c. slaver writes "resume" to tell master that SVM is resumed > 3. checkpoint(): In colo mode, we do a new checkpoint only when output packet > from PVM and SVM is different. We will block in this callback and return > when a output packet is different. > > Signed-off-by: Ye Wei <wei.ye1987@gmail.com> > Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> > Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> > --- > tools/libxc/xc_domain_save.c | 17 ++ > tools/libxc/xenguest.h | 3 + > tools/python/xen/lowlevel/checkpoint/checkpoint.c | 302 ++++++++++++++++++++- > tools/python/xen/lowlevel/checkpoint/checkpoint.h | 1 + > 4 files changed, 319 insertions(+), 4 deletions(-) > > diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c > index b477188..8f84c9b 100644 > --- a/tools/libxc/xc_domain_save.c > +++ b/tools/libxc/xc_domain_save.c > @@ -1785,6 +1785,23 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter > } > } > > + /* Flush last write and discard cache for file. */ > + if ( outbuf_flush(xch, ob, io_fd) < 0 ) { > + PERROR("Error when flushing output buffer"); > + rc = 1; > + } > + > + discard_file_cache(xch, io_fd, 1 /* flush */); > + > + if ( callbacks->post_sendstate ) > + { > + if ( callbacks->post_sendstate(callbacks->data) < 0) > + { > + PERROR("Error: post_sendstate()\n"); > + goto out; > + } > + } > + > /* Zero terminate */ > i = 0; > if ( wrexact(io_fd, &i, sizeof(int)) ) > diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h > index 4bb444a..9d7d03c 100644 > --- a/tools/libxc/xenguest.h > +++ b/tools/libxc/xenguest.h > @@ -72,6 +72,9 @@ struct save_callbacks { > */ > int (*toolstack_save)(uint32_t domid, uint8_t **buf, uint32_t *len, void *data); > > + /* called before Zero terminate is sent */ > + int (*post_sendstate)(void *data); > + > /* to be provided as the last argument to each callback function */ > void* data; > }; > diff --git a/tools/python/xen/lowlevel/checkpoint/checkpoint.c b/tools/python/xen/lowlevel/checkpoint/checkpoint.c > index ec14b27..28bdb23 100644 > --- a/tools/python/xen/lowlevel/checkpoint/checkpoint.c > +++ b/tools/python/xen/lowlevel/checkpoint/checkpoint.c > @@ -1,14 +1,22 @@ > /* python bridge to checkpointing API */ > > #include <Python.h> > +#include <sys/wait.h>I cant see anything using this header file which is good, as otherwise I would still tell you that a python module should not be using any of its contents. ~Andrew> > #include <xenstore.h> > #include <xenctrl.h> > +#include <xc_private.h> > +#include <xg_save_restore.h> > > #include "checkpoint.h" > > #define PKG "xen.lowlevel.checkpoint" > > +#define COMP_IOC_MAGIC ''k'' > +#define COMP_IOCTWAIT _IO(COMP_IOC_MAGIC, 0) > +#define COMP_IOCTFLUSH _IO(COMP_IOC_MAGIC, 1) > +#define COMP_IOCTRESUME _IO(COMP_IOC_MAGIC, 2) > + > static PyObject* CheckpointError; > > typedef struct { > @@ -25,11 +33,15 @@ typedef struct { > PyObject* setup_cb; > > PyThreadState* threadstate; > + int colo; > + int first_time; > + int dev_fd; > } CheckpointObject; > > static int suspend_trampoline(void* data); > static int postcopy_trampoline(void* data); > static int checkpoint_trampoline(void* data); > +static int post_sendstate_trampoline(void *data); > > static PyObject* Checkpoint_new(PyTypeObject* type, PyObject* args, > PyObject* kwargs) > @@ -169,10 +181,17 @@ static PyObject* pycheckpoint_start(PyObject* obj, PyObject* args) { > } else > self->setup_cb = NULL; > > + if (flags & CHECKPOINT_FLAGS_COLO) > + self->colo = 1; > + else > + self->colo = 0; > + self->first_time = 1; > + > memset(&callbacks, 0, sizeof(callbacks)); > callbacks.suspend = suspend_trampoline; > callbacks.postcopy = postcopy_trampoline; > callbacks.checkpoint = checkpoint_trampoline; > + callbacks.post_sendstate = post_sendstate_trampoline; > callbacks.data = self; > > self->threadstate = PyEval_SaveThread(); > @@ -279,6 +298,196 @@ PyMODINIT_FUNC initcheckpoint(void) { > block_timer(); > } > > +/* colo functions */ > + > +/* master slaver comment > + * "continue" ===> > + * <=== "suspend" guest is suspended > + */ > +static int notify_slaver_suspend(CheckpointObject *self) > +{ > + int fd = self->cps.fd; > + > + if (self->first_time == 1) > + return 0; > + > + return write_exact(fd, "continue", 8); > +} > + > +static int wait_slaver_suspend(CheckpointObject *self) > +{ > + int fd = self->cps.fd; > + xc_interface *xch = self->cps.xch; > + char buf[8]; > + > + if (self->first_time == 1) > + return 0; > + > + if ( read_exact(fd, buf, 7) < 0) { > + PERROR("read: suspend"); > + return -1; > + } > + > + buf[7] = ''\0''; > + if (strcmp(buf, "suspend")) { > + PERROR("read \"%s\", expect \"suspend\"", buf); > + return -1; > + } > + > + return 0; > +} > + > +static int notify_slaver_start_checkpoint(CheckpointObject *self) > +{ > + int fd = self->cps.fd; > + xc_interface *xch = self->cps.xch; > + > + if (self->first_time == 1) > + return 0; > + > + if ( write_exact(fd, "start", 5) < 0) { > + PERROR("write start"); > + return -1; > + } > + > + return 0; > +} > + > +/* > + * master slaver > + * <==== "finish" > + * flush packets > + * "resume" ====> > + * resume vm resume vm > + * <==== "resume" > + */ > +static int notify_slaver_resume(CheckpointObject *self) > +{ > + int fd = self->cps.fd; > + xc_interface *xch = self->cps.xch; > + char buf[7]; > + > + /* wait slaver to finish update memory, device state... */ > + if ( read_exact(fd, buf, 6) < 0) { > + PERROR("read: finish"); > + return -1; > + } > + > + buf[6] = ''\0''; > + if (strcmp(buf, "finish")) { > + ERROR("read \"%s\", expect \"finish\"", buf); > + return -1; > + } > + > + if (!self->first_time) > + /* flush queued packets now */ > + ioctl(self->dev_fd, COMP_IOCTFLUSH); > + > + /* notify slaver to resume vm*/ > + if (write_exact(fd, "resume", 6) < 0) { > + PERROR("write: resume"); > + return -1; > + } > + > + return 0; > +} > + > +static int install_fw_network(CheckpointObject *self) > +{ > + int rc; > + PyObject* result; > + > + PyEval_RestoreThread(self->threadstate); > + result = PyObject_CallFunction(self->setup_cb, NULL); > + self->threadstate = PyEval_SaveThread(); > + > + if (!result) > + return -1; > + > + if (result == Py_None || PyObject_IsTrue(result)) > + rc = 0; > + else > + rc = -1; > + > + Py_DECREF(result); > + > + return rc; > +} > + > +static int wait_slaver_resume(CheckpointObject *self) > +{ > + int fd = self->cps.fd; > + xc_interface *xch = self->cps.xch; > + char buf[7]; > + > + if (read_exact(fd, buf, 6) < 0) { > + PERROR("read resume"); > + return -1; > + } > + > + buf[6] = ''\0''; > + if (strcmp(buf, "resume")) { > + ERROR("read \"%s\", expect \"resume\"", buf); > + return -1; > + } > + > + return 0; > +} > + > +static int colo_postresume(CheckpointObject *self) > +{ > + int rc; > + int dev_fd = self->dev_fd; > + > + rc = wait_slaver_resume(self); > + if (rc < 0) > + return rc; > + > + if (self->first_time) { > + rc = install_fw_network(self); > + if (rc < 0) { > + fprintf(stderr, "install network fails\n"); > + return rc; > + } > + } else { > + ioctl(dev_fd, COMP_IOCTRESUME); > + } > + > + return 0; > +} > + > +static int pre_checkpoint(CheckpointObject *self) > +{ > + xc_interface *xch = self->cps.xch; > + > + if (!self->first_time) > + return 0; > + > + self->dev_fd = open("/dev/HA_compare", O_RDWR); > + if (self->dev_fd < 0) { > + PERROR("opening /dev/HA_compare fails"); > + return -1; > + } > + > + return 0; > +} > + > +static void wait_new_checkpoint(CheckpointObject *self) > +{ > + int dev_fd = self->dev_fd; > + int err; > + > + while (1) { > + err = ioctl(dev_fd, COMP_IOCTWAIT); > + if (err == 0) > + break; > + > + if (err == -1 && errno != ERESTART && errno != ETIME) { > + fprintf(stderr, "ioctl() returns -1, errno: %d\n", errno); > + } > + } > +} > + > /* private functions */ > > /* bounce C suspend call into python equivalent. > @@ -289,6 +498,13 @@ static int suspend_trampoline(void* data) > > PyObject* result; > > + if (self->colo) { > + if (notify_slaver_suspend(self) < 0) { > + fprintf(stderr, "nofitying slaver suspend fails\n"); > + return 0; > + } > + } > + > /* call default suspend function, then python hook if available */ > if (self->armed) { > if (checkpoint_wait(&self->cps) < 0) { > @@ -307,8 +523,16 @@ static int suspend_trampoline(void* data) > } > } > > + /* suspend_cb() should be called after both sides are suspended */ > + if (self->colo) { > + if (wait_slaver_suspend(self) < 0) { > + fprintf(stderr, "waiting slaver suspend fails\n"); > + return 0; > + } > + } > + > if (!self->suspend_cb) > - return 1; > + goto start_checkpoint; > > PyEval_RestoreThread(self->threadstate); > result = PyObject_CallFunction(self->suspend_cb, NULL); > @@ -319,12 +543,32 @@ static int suspend_trampoline(void* data) > > if (result == Py_None || PyObject_IsTrue(result)) { > Py_DECREF(result); > - return 1; > + goto start_checkpoint; > } > > Py_DECREF(result); > > return 0; > + > +start_checkpoint: > + if (self->colo) { > + if (notify_slaver_start_checkpoint(self) < 0) { > + fprintf(stderr, "nofitying slaver to start checkpoint fails\n"); > + return 0; > + } > + > + /* PVM is suspended first when doing live migration, > + * and then it is suspended for a new checkpoint. > + */ > + if (self->first_time == 1) > + /* live migration */ > + self->first_time = 2; > + else if (self->first_time == 2) > + /* the first checkpoint */ > + self->first_time = 0; > + } > + > + return 1; > } > > static int postcopy_trampoline(void* data) > @@ -334,6 +578,13 @@ static int postcopy_trampoline(void* data) > PyObject* result; > int rc = 0; > > + if (self->colo) { > + if (notify_slaver_resume(self) < 0) { > + fprintf(stderr, "nofitying slaver resume fails\n"); > + return 0; > + } > + } > + > if (!self->postcopy_cb) > goto resume; > > @@ -352,6 +603,13 @@ static int postcopy_trampoline(void* data) > return 0; > } > > + if (self->colo) { > + if (colo_postresume(self) < 0) { > + fprintf(stderr, "postresume fails\n"); > + return 0; > + } > + } > + > return rc; > } > > @@ -366,8 +624,15 @@ static int checkpoint_trampoline(void* data) > return -1; > } > > + if (self->colo) { > + if (pre_checkpoint(self) < 0) { > + fprintf(stderr, "pre_checkpoint() fails\n"); > + return -1; > + } > + } > + > if (!self->checkpoint_cb) > - return 0; > + goto wait_checkpoint; > > PyEval_RestoreThread(self->threadstate); > result = PyObject_CallFunction(self->checkpoint_cb, NULL); > @@ -378,10 +643,39 @@ static int checkpoint_trampoline(void* data) > > if (result == Py_None || PyObject_IsTrue(result)) { > Py_DECREF(result); > - return 1; > + goto wait_checkpoint; > } > > Py_DECREF(result); > > return 0; > + > +wait_checkpoint: > + if (self->colo) { > + wait_new_checkpoint(self); > + } > + > + fprintf(stderr, "\n\nnew checkpoint..........\n"); > + > + return 1; > +} > + > +static int post_sendstate_trampoline(void* data) > +{ > + CheckpointObject *self = data; > + int fd = self->cps.fd; > + int i = XC_SAVE_ID_LAST_CHECKPOINT; > + > + if (!self->colo) > + return 0; > + > + /* In colo mode, guest is running on slaver side, so we should > + * send XC_SAVE_ID_LAST_CHECKPOINT to slaver. > + */ > + if (write_exact(fd, &i, sizeof(int)) < 0) { > + fprintf(stderr, "writing XC_SAVE_ID_LAST_CHECKPOINT fails\n"); > + return -1; > + } > + > + return 0; > } > diff --git a/tools/python/xen/lowlevel/checkpoint/checkpoint.h b/tools/python/xen/lowlevel/checkpoint/checkpoint.h > index 187d9d7..96fc949 100644 > --- a/tools/python/xen/lowlevel/checkpoint/checkpoint.h > +++ b/tools/python/xen/lowlevel/checkpoint/checkpoint.h > @@ -41,6 +41,7 @@ typedef struct { > } checkpoint_state; > > #define CHECKPOINT_FLAGS_COMPRESSION 1 > +#define CHECKPOINT_FLAGS_COLO 2 > char* checkpoint_error(checkpoint_state* s); > > void checkpoint_init(checkpoint_state* s);
Wen Congyang
2013-Jul-12 01:36 UTC
Re: [RFC Patch v2 13/16] xc_domain_save: flush cache before calling callbacks->postcopy()
At 07/11/2013 09:43 PM, Andrew Cooper Wrote:> On 11/07/13 09:35, Wen Congyang wrote: >> callbacks->postcopy() may use the fd to transfer something to the >> other end, so we should flush cache before calling callbacks->postcopy() >> >> Signed-off-by: Ye Wei <wei.ye1987@gmail.com> >> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com> >> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> >> --- > > This looks like it is a bugfix on its own, so perhaps might be better > submitted as individual fix, rather than being mixed in with a huge > series for new functionaltiyCurrently, callbacks->postcopy() does not use this fd to send anything to the other end. So remus can work. In colo mode, we will use this fd, so I fix it. Thanks Wen Congyang> > ~Andrew > >> tools/libxc/xc_domain_save.c | 6 +++--- >> 1 files changed, 3 insertions(+), 3 deletions(-) >> >> diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c >> index fbc15e9..b477188 100644 >> --- a/tools/libxc/xc_domain_save.c >> +++ b/tools/libxc/xc_domain_save.c >> @@ -2034,9 +2034,6 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter >> out: >> completed = 1; >> >> - if ( !rc && callbacks->postcopy ) >> - callbacks->postcopy(callbacks->data); >> - >> /* guest has been resumed. Now we can compress data >> * at our own pace. >> */ >> @@ -2066,6 +2063,9 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter >> >> discard_file_cache(xch, io_fd, 1 /* flush */); >> >> + if ( !rc && callbacks->postcopy ) >> + callbacks->postcopy(callbacks->data); >> + >> /* Enable compression now, finally */ >> compressing = (flags & XCFLAGS_CHECKPOINT_COMPRESS); >> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
Shriram Rajagopalan
2013-Jul-14 14:33 UTC
Re: [RFC Patch v2 00/16] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
On Thursday, July 11, 2013, Ian Campbell wrote:> On Thu, 2013-07-11 at 16:35 +0800, Wen Congyang wrote: > > [...] > > XendCheckpoint: implement colo > > xend has been deprecated for two releases now. I''m afraid any new > functionality of this magnitude is going to need to integrate with libxl > instead. > > It''s a shame that the libxl/xl support for Remus appears to have > stalled. > >Sorry I have been AWOL for a while. Saw that xl supports drbd now. Will try to push disk checkpoint support shortly. Followed by network buffering.> Ian. > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Tim Deegan
2013-Aug-01 11:48 UTC
Re: [RFC Patch v2 01/16] xen: introduce new hypercall to reset vcpu
Hi, At 16:35 +0800 on 11 Jul (1373560533), Wen Congyang wrote:> In colo mode, SVM is running, and it will create pagetable, use gdt... > When we do a new checkpoint, we may need to rollback all this operations. > This new hypercall will do this.Can you do what you need with XEN_DOMCTL_setvcpucontext(domid, vcpuid, NULL)? If not, maybe some small extensions to that call would be enough? I think if we do need an entirely new hypercall, it should be part of the DOMCTL call along with the other vcpu operations, rather than having a new top-level hypercall of its own. Cheers, Tim.
Wen Congyang
2013-Aug-06 06:47 UTC
Re: [RFC Patch v2 01/16] xen: introduce new hypercall to reset vcpu
At 08/01/2013 07:48 PM, Tim Deegan Wrote:> Hi, > > At 16:35 +0800 on 11 Jul (1373560533), Wen Congyang wrote: >> In colo mode, SVM is running, and it will create pagetable, use gdt... >> When we do a new checkpoint, we may need to rollback all this operations. >> This new hypercall will do this. > > Can you do what you need with XEN_DOMCTL_setvcpucontext(domid, vcpuid, NULL)? > If not, maybe some small extensions to that call would be enough?I will try it.> > I think if we do need an entirely new hypercall, it should be part of > the DOMCTL call along with the other vcpu operations, rather than having > a new top-level hypercall of its own.If setvcpucontext() can''t work, I will add a new subop to DOMCTL instead of a new hypercall. Thanks Wen Congyang> > Cheers, > > Tim. >