This patch series introduces a basic framework to incorporate Remus into the libxl toolstack. The only functionality currently implemented is memory checkpointing. These patches depend on "libxl: refactor suspend/resume code" patch series. Changes since previous version: * Receiving end of remus is now part of migrate-receive * Re-use code already present for live migration (refactored in the above mentioned patch series) Shriram
rshriram@cs.ubc.ca
2012-Jan-31 01:22 UTC
[PATCH 1 of 2] libxl: Remus - suspend/postflush/commit callbacks
# HG changeset patch # User Shriram Rajagopalan <rshriram@cs.ubc.ca> # Date 1327972786 28800 # Node ID 0129989da2cda789f86f2bc83d9c642a0bcebe60 # Parent ffc99e708e90eb140b0a6f2e7087d567e71e9d0f libxl: Remus - suspend/postflush/commit callbacks * Add libxl callback functions for Remus checkpoint suspend, postflush (aka resume) and checkpoint commit callbacks. * suspend callback is a stub that just bounces off libxl__domain_suspend_common_callback - which suspends the domain and saves the devices model state to a file. * resume callback currently just resumes the domain (and the device model). * commit callback just writes out the saved device model state to the network and sleeps for the checkpoint interval. * Future patches will augument these callbacks with more functionalities like issuing network buffer plug/unplug commands, disk checkpoint commands, etc. Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca> diff -r ffc99e708e90 -r 0129989da2cd tools/libxl/libxl.c --- a/tools/libxl/libxl.c Mon Jan 30 16:59:01 2012 -0800 +++ b/tools/libxl/libxl.c Mon Jan 30 17:19:46 2012 -0800 @@ -480,7 +480,9 @@ int libxl_domain_suspend(libxl_ctx *ctx, int debug = info != NULL && info->flags & XL_SUSPEND_DEBUG; int rc = 0; - rc = libxl__domain_suspend_common(gc, domid, fd, type, live, debug); + rc = libxl__domain_suspend_common(gc, domid, fd, type, live, debug, + /* No Remus */ NULL); + if (!rc && type == LIBXL_DOMAIN_TYPE_HVM) rc = libxl__domain_save_device_model(gc, domid, fd); GC_FREE; diff -r ffc99e708e90 -r 0129989da2cd tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c Mon Jan 30 16:59:01 2012 -0800 +++ b/tools/libxl/libxl_dom.c Mon Jan 30 17:19:46 2012 -0800 @@ -404,6 +404,8 @@ struct suspendinfo { int hvm; unsigned int flags; int guest_responded; + int save_fd; /* Migration stream fd (for Remus) */ + int interval; /* checkpoint interval (for Remus) */ }; static int libxl__domain_suspend_common_switch_qemu_logdirty(int domid, unsigned int enable, void *data) @@ -609,9 +611,43 @@ static int libxl__domain_suspend_common_ return 1; } +static int libxl__remus_domain_suspend_callback(void *data) +{ + /* TODO: Issue disk and network checkpoint reqs. */ + return libxl__domain_suspend_common_callback(data); +} + +static int libxl__remus_domain_resume_callback(void *data) +{ + struct suspendinfo *si = data; + libxl_ctx *ctx = libxl__gc_owner(si->gc); + + /* Resumes the domain and the device model */ + if (libxl_domain_resume(ctx, si->domid, /* Fast Suspend */1)) + return 0; + + /* TODO: Deal with disk. Start a new network output buffer */ + return 1; +} + +static int libxl__remus_domain_checkpoint_callback(void *data) +{ + struct suspendinfo *si = data; + + /* This would go into tailbuf. */ + if (si->hvm && + libxl__domain_save_device_model(si->gc, si->domid, si->save_fd)) + return 0; + + /* TODO: Wait for disk and memory ack, release network buffer */ + usleep(si->interval * 1000); + return 1; +} + int libxl__domain_suspend_common(libxl__gc *gc, uint32_t domid, int fd, libxl_domain_type type, - int live, int debug) + int live, int debug, + const libxl_domain_remus_info *r_info) { libxl_ctx *ctx = libxl__gc_owner(gc); int flags; @@ -642,10 +678,20 @@ int libxl__domain_suspend_common(libxl__ return ERROR_INVAL; } + memset(&si, 0, sizeof(si)); flags = (live) ? XCFLAGS_LIVE : 0 | (debug) ? XCFLAGS_DEBUG : 0 | (hvm) ? XCFLAGS_HVM : 0; + if (r_info != NULL) { + si.interval = r_info->interval; + if (r_info->compression) + flags |= XCFLAGS_CHECKPOINT_COMPRESS; + si.save_fd = fd; + } + else + si.save_fd = -1; + si.domid = domid; si.flags = flags; si.hvm = hvm; @@ -669,7 +715,27 @@ int libxl__domain_suspend_common(libxl__ } memset(&callbacks, 0, sizeof(callbacks)); - callbacks.suspend = libxl__domain_suspend_common_callback; + if (r_info != NULL) { + /* save_callbacks: + * suspend - called after expiration of checkpoint interval, + * to *suspend* the domain. + * + * postcopy - called after the domain''s dirty pages have been + * copied into an output buffer. We *resume* the domain + * & the device model, return to the caller. Caller then + * flushes the output buffer, while the domain continues to run. + * + * checkpoint - called after the memory checkpoint has been flushed out + * into the network. Send the saved device state, *wait* + * for checkpoint ack and *release* the network buffer (TBD). + * Then *sleep* for the checkpoint interval. + */ + callbacks.suspend = libxl__remus_domain_suspend_callback; + callbacks.postcopy = libxl__remus_domain_resume_callback; + callbacks.checkpoint = libxl__remus_domain_checkpoint_callback; + } else + callbacks.suspend = libxl__domain_suspend_common_callback; + callbacks.switch_qemu_logdirty = libxl__domain_suspend_common_switch_qemu_logdirty; callbacks.data = &si; diff -r ffc99e708e90 -r 0129989da2cd tools/libxl/libxl_internal.h --- a/tools/libxl/libxl_internal.h Mon Jan 30 16:59:01 2012 -0800 +++ b/tools/libxl/libxl_internal.h Mon Jan 30 17:19:46 2012 -0800 @@ -275,7 +275,8 @@ _hidden int libxl__domain_restore_common int fd); _hidden int libxl__domain_suspend_common(libxl__gc *gc, uint32_t domid, int fd, libxl_domain_type type, - int live, int debug); + int live, int debug, + const libxl_domain_remus_info *r_info); _hidden const char *libxl__device_model_savefile(libxl__gc *gc, uint32_t domid); _hidden int libxl__domain_suspend_device_model(libxl__gc *gc, uint32_t domid); _hidden int libxl__domain_resume_device_model(libxl__gc *gc, uint32_t domid); diff -r ffc99e708e90 -r 0129989da2cd tools/libxl/libxl_types.idl --- a/tools/libxl/libxl_types.idl Mon Jan 30 16:59:01 2012 -0800 +++ b/tools/libxl/libxl_types.idl Mon Jan 30 17:19:46 2012 -0800 @@ -397,3 +397,9 @@ libxl_sched_sedf = Struct("sched_sedf", ("extratime", integer), ("weight", integer), ], dispose_fn=None) + +libxl_domain_remus_info = Struct("domain_remus_info",[ + ("interval", integer), + ("blackhole", bool), + ("compression", bool), + ])
# HG changeset patch # User Shriram Rajagopalan <rshriram@cs.ubc.ca> # Date 1327972788 28800 # Node ID 070f78ba4244cf8239ded3a69ccd54ac6e1bd24d # Parent 0129989da2cda789f86f2bc83d9c642a0bcebe60 libxl: Remus - xl remus command xl remus acts as a frontend to enable remus for a given domain. * At the moment, only memory checkpointing and blackhole replication is supported. Support for disk checkpointing and network buffering will be added in future. * Replication is done over ssh connection currently (like live migration with xl). Future versions will have an option to use simple tcp socket based replication channel (for both Remus & live migration). Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca> diff -r 0129989da2cd -r 070f78ba4244 tools/libxl/libxl.c --- a/tools/libxl/libxl.c Mon Jan 30 17:19:46 2012 -0800 +++ b/tools/libxl/libxl.c Mon Jan 30 17:19:48 2012 -0800 @@ -471,6 +471,40 @@ libxl_vminfo * libxl_list_vm(libxl_ctx * return ptr; } +int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info, + uint32_t domid, int fd) +{ + GC_INIT(ctx); + libxl_domain_type type = libxl__domain_type(gc, domid); + int rc = 0; + + if (info == NULL) { + LIBXL__LOG(ctx, LIBXL__LOG_ERROR, + "No remus_info structure supplied for domain %d", domid); + rc = -1; + goto remus_fail; + } + + /* TBD: Remus setup - i.e. attach qdisc, enable disk buffering, etc */ + + /* Point of no return */ + rc = libxl__domain_suspend_common(gc, domid, fd, type, /* live */ 1, + /* debug */ 0, info); + + /* + * With Remus, if we reach this point, it means either + * backup died or some network error occurred preventing us + * from sending checkpoints. + */ + + /* TBD: Remus cleanup - i.e. detach qdisc, release other + * resources. + */ + remus_fail: + GC_FREE; + return rc; +} + int libxl_domain_suspend(libxl_ctx *ctx, libxl_domain_suspend_info *info, uint32_t domid, int fd) { diff -r 0129989da2cd -r 070f78ba4244 tools/libxl/libxl.h --- a/tools/libxl/libxl.h Mon Jan 30 17:19:46 2012 -0800 +++ b/tools/libxl/libxl.h Mon Jan 30 17:19:48 2012 -0800 @@ -266,6 +266,8 @@ typedef int (*libxl_console_ready)(libxl int libxl_domain_create_new(libxl_ctx *ctx, libxl_domain_config *d_config, libxl_console_ready cb, void *priv, uint32_t *domid); int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config, libxl_console_ready cb, void *priv, uint32_t *domid, int restore_fd); void libxl_domain_config_dispose(libxl_domain_config *d_config); +int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info, + uint32_t domid, int fd); int libxl_domain_suspend(libxl_ctx *ctx, libxl_domain_suspend_info *info, uint32_t domid, int fd); int libxl_domain_resume(libxl_ctx *ctx, uint32_t domid, int fast); diff -r 0129989da2cd -r 070f78ba4244 tools/libxl/xl.h --- a/tools/libxl/xl.h Mon Jan 30 17:19:46 2012 -0800 +++ b/tools/libxl/xl.h Mon Jan 30 17:19:48 2012 -0800 @@ -94,6 +94,7 @@ int main_cpupoolnumasplit(int argc, char int main_getenforce(int argc, char **argv); int main_setenforce(int argc, char **argv); int main_loadpolicy(int argc, char **argv); +int main_remus(int argc, char **argv); void help(const char *command); diff -r 0129989da2cd -r 070f78ba4244 tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c Mon Jan 30 17:19:46 2012 -0800 +++ b/tools/libxl/xl_cmdimpl.c Mon Jan 30 17:19:48 2012 -0800 @@ -2846,7 +2846,7 @@ static void core_dump_domain(const char * an ssh channel. */ static void migrate_receive(int debug, int daemonize, int monitor, - int send_fd, int recv_fd) + int send_fd, int recv_fd, int remus) { int rc, rc2; char rc_buf; @@ -2881,6 +2881,41 @@ static void migrate_receive(int debug, i exit(-rc); } + if (remus) { + /* If we are here, it means that the sender (primary) has crashed. + * TODO: Split-Brain Check. + */ + fprintf(stderr, "migration target: Remus Failover for domain %u\n", + domid); + + /* + * If domain renaming fails, lets just continue (as we need the domain + * to be up & dom names may not matter much, as long as its reachable + * over network). + * + * If domain unpausing fails, destroy domain ? Or is it better to have + * a consistent copy of the domain (memory, cpu state, disk) + * on atleast one physical host ? Right now, lets just leave the domain + * as is and let the Administrator decide (or troubleshoot). + */ + if (migration_domname) { + rc = libxl_domain_rename(ctx, domid, migration_domname, + common_domname); + if (rc) + fprintf(stderr, "migration target (Remus): " + "Failed to rename domain from %s to %s:%d\n", + migration_domname, common_domname, rc); + } + + rc = libxl_domain_unpause(ctx, domid); + if (rc) + fprintf(stderr, "migration target (Remus): " + "Failed to unpause domain %s (id: %u):%d\n", + common_domname, domid, rc); + + exit(rc ? -ERROR_FAIL: 0); + } + fprintf(stderr, "migration target: Transfer complete," " requesting permission to start domain.\n"); @@ -3007,10 +3042,10 @@ int main_restore(int argc, char **argv) int main_migrate_receive(int argc, char **argv) { - int debug = 0, daemonize = 1, monitor = 1; + int debug = 0, daemonize = 1, monitor = 1, remus = 0; int opt; - while ((opt = def_getopt(argc, argv, "Fed", "migrate-receive", 0)) != -1) { + while ((opt = def_getopt(argc, argv, "Fedr", "migrate-receive", 0)) != -1) { switch (opt) { case 0: case 2: return opt; @@ -3024,6 +3059,9 @@ int main_migrate_receive(int argc, char case ''d'': debug = 1; break; + case ''r'': + remus = 1; + break; } } @@ -3032,7 +3070,8 @@ int main_migrate_receive(int argc, char return 2; } migrate_receive(debug, daemonize, monitor, - /* write to stdout */1, /* read from stdin */0); + /* write to stdout */ 1, /* read from stdin */ 0, + remus); return 0; } @@ -5968,6 +6007,91 @@ done: return ret; } +int main_remus(int argc, char **argv) +{ + int opt, rc, daemonize = 1; + const char *ssh_command = "ssh"; + char *host = NULL, *rune = NULL, *domain = NULL; + libxl_domain_remus_info r_info; + int send_fd = -1, recv_fd = -1; + + memset(&r_info, 0, sizeof(libxl_domain_remus_info)); + /* Defaults */ + r_info.interval = 200; + r_info.blackhole = 0; + r_info.compression = 1; + + while ((opt = def_getopt(argc, argv, "bui:s:e", "remus", 2)) != -1) { + switch (opt) { + case 0: case 2: + return opt; + + case ''i'': + r_info.interval = atoi(optarg); + break; + case ''b'': + r_info.blackhole = 1; + break; + case ''u'': + r_info.compression = 0; + break; + case ''s'': + ssh_command = optarg; + break; + case ''e'': + daemonize = 0; + break; + } + } + + domain = argv[optind]; + host = argv[optind + 1]; + + if (r_info.blackhole) { + find_domain(domain); + send_fd = open("/dev/null", O_RDWR, 0644); + if (send_fd < 0) { + perror("failed to open /dev/null"); + exit(-1); + } + } else { + + /* + * TODO: change this to plain TCP socket based channel + * instead of SSH. For both live-migration and Remus. + */ + if (!ssh_command[0]) { + rune = host; + } else { + if (asprintf(&rune, "exec %s %s xl migrate-receive -r %s", + ssh_command, host, + daemonize ? "" : " -e") < 0) + return 1; + } + migrate_do_preamble(domain, rune, NULL, &send_fd, &recv_fd, NULL); + } + + /* Point of no return */ + rc = libxl_domain_remus_start(ctx, &r_info, domid, send_fd); + + /* If we are here, it means backup has failed/domain suspend failed. + * Try to resume the domain and exit gracefully. + * TODO: Split-Brain check. + */ + fprintf(stderr, "remus sender: libxl_domain_suspend failed" + " (rc=%d)\n", rc); + + if (rc == ERROR_GUEST_TIMEDOUT) + fprintf(stderr, "Failed to suspend domain at primary.\n"); + else { + fprintf(stderr, "Remus: Backup failed? resuming domain at primary.\n"); + libxl_domain_resume(ctx, domid, 1); + } + + close(send_fd); + return -ERROR_FAIL; +} + /* * Local variables: * mode: C diff -r 0129989da2cd -r 070f78ba4244 tools/libxl/xl_cmdtable.c --- a/tools/libxl/xl_cmdtable.c Mon Jan 30 17:19:46 2012 -0800 +++ b/tools/libxl/xl_cmdtable.c Mon Jan 30 17:19:48 2012 -0800 @@ -412,6 +412,20 @@ struct cmd_spec cmd_table[] = { "Loads a new policy int the Flask Xen security module", "<policy file>", }, + { "remus", + &main_remus, 0, + "Enable Remus HA for domain", + "[options] <Domain> [<host>]", + "-i MS Checkpoint domain memory every MS milliseconds (def. 200ms).\n" + "-b Replicate memory checkpoints to /dev/null (blackhole)\n" + "-u Disable memory checkpoint compression.\n" + "-s <sshcommand> Use <sshcommand> instead of ssh. String will be passed\n" + " to sh. If empty, run <host> instead of \n" + " ssh <host> xl migrate-receive -r\n" + "-e Do not wait in the background (on <host>) for the death\n" + " of the domain." + + }, }; int cmdtable_len = sizeof(cmd_table)/sizeof(struct cmd_spec);
Ian Campbell
2012-Jan-31 10:53 UTC
Re: [PATCH 1 of 2] libxl: Remus - suspend/postflush/commit callbacks
> @@ -669,7 +715,27 @@ int libxl__domain_suspend_common(libxl__ > } > > memset(&callbacks, 0, sizeof(callbacks)); > - callbacks.suspend = libxl__domain_suspend_common_callback; > + if (r_info != NULL) { > + /* save_callbacks: > + * suspend - called after expiration of checkpoint interval, > + * to *suspend* the domain. > + * > + * postcopy - called after the domain''s dirty pages have been > + * copied into an output buffer. We *resume* the domain > + * & the device model, return to the caller. Caller then > + * flushes the output buffer, while the domain continues to run. > + * > + * checkpoint - called after the memory checkpoint has been flushed out > + * into the network. Send the saved device state, *wait* > + * for checkpoint ack and *release* the network buffer (TBD). > + * Then *sleep* for the checkpoint interval. > + */Please can you put this next to the definition of the struct (in xenguest.h). Otherwise this patch looks ok to me. Ian.
On Tue, 2012-01-31 at 01:22 +0000, rshriram@cs.ubc.ca wrote:> # HG changeset patch > # User Shriram Rajagopalan <rshriram@cs.ubc.ca> > # Date 1327972788 28800 > # Node ID 070f78ba4244cf8239ded3a69ccd54ac6e1bd24d > # Parent 0129989da2cda789f86f2bc83d9c642a0bcebe60 > libxl: Remus - xl remus command > > xl remus acts as a frontend to enable remus for a given domain. > * At the moment, only memory checkpointing and blackhole replication is > supported. Support for disk checkpointing and network buffering will > be added in future. > * Replication is done over ssh connection currently (like live migration > with xl). Future versions will have an option to use simple tcp socket > based replication channel (for both Remus & live migration). > > Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca> > > diff -r 0129989da2cd -r 070f78ba4244 tools/libxl/libxl.c > --- a/tools/libxl/libxl.c Mon Jan 30 17:19:46 2012 -0800 > +++ b/tools/libxl/libxl.c Mon Jan 30 17:19:48 2012 -0800 > @@ -471,6 +471,40 @@ libxl_vminfo * libxl_list_vm(libxl_ctx * > return ptr; > } > > +int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info, > + uint32_t domid, int fd)This new function probably belongs in patch 1/2 with the other libxl changes.> +{ > + GC_INIT(ctx); > + libxl_domain_type type = libxl__domain_type(gc, domid); > + int rc = 0; > + > + if (info == NULL) { > + LIBXL__LOG(ctx, LIBXL__LOG_ERROR, > + "No remus_info structure supplied for domain %d", domid); > + rc = -1;libxl functions should return one of ERROR_* not -1 one. In this case ERROR_INVAL I suppose. [...]> diff -r 0129989da2cd -r 070f78ba4244 tools/libxl/xl_cmdtable.c > --- a/tools/libxl/xl_cmdtable.c Mon Jan 30 17:19:46 2012 -0800 > +++ b/tools/libxl/xl_cmdtable.c Mon Jan 30 17:19:48 2012 -0800 > @@ -412,6 +412,20 @@ struct cmd_spec cmd_table[] = { > "Loads a new policy int the Flask Xen security module", > "<policy file>", > }, > + { "remus",Please also add this command to docs/man/xl.pod.1. Otherwise this patch looks good, thanks. Ian.
Shriram Rajagopalan
2012-Jan-31 17:59 UTC
Re: [PATCH 1 of 2] libxl: Remus - suspend/postflush/commit callbacks
On Tue, Jan 31, 2012 at 2:53 AM, Ian Campbell <Ian.Campbell@citrix.com>wrote:> > @@ -669,7 +715,27 @@ int libxl__domain_suspend_common(libxl__ > > } > > > > memset(&callbacks, 0, sizeof(callbacks)); > > - callbacks.suspend = libxl__domain_suspend_common_callback; > > + if (r_info != NULL) { > > + /* save_callbacks: > > + * suspend - called after expiration of checkpoint interval, > > + * to *suspend* the domain. > > + * > > + * postcopy - called after the domain''s dirty pages have been > > + * copied into an output buffer. We *resume* the > domain > > + * & the device model, return to the caller. Caller > then > > + * flushes the output buffer, while the domain > continues to run. > > + * > > + * checkpoint - called after the memory checkpoint has been > flushed out > > + * into the network. Send the saved device state, > *wait* > > + * for checkpoint ack and *release* the network > buffer (TBD). > > + * Then *sleep* for the checkpoint interval. > > + */ > > Please can you put this next to the definition of the struct (in > xenguest.h). Otherwise this patch looks ok to me. > > Ian. > > >struct save_callbacks already has comments on these callbacks, it is generic though, as the postcopy and checkpoint callbacks could be used even without Remus. One need not necessarily do all the actions specified in the above comment. But, with libxl/remus, it seemed proper to add more detailed comments to the code that does other things in the callback, specific to remus/checkpointing. shriram _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel