thr3ads.net - Xen devel - [PATCH 0 of 5 V3] Remus/Libxl: Network buffering support [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Shriram Rajagopalan

2013-Oct-21 05:58 UTC

[PATCH 0 of 5 V3] Remus/Libxl: Network buffering support

This patch series adds support for network buffering in the Remus
codebase in libxl.

Changes in V3:
[1/5] Fix redundant checks in configure scripts
      (based on Ian Campbell''s suggestions)

[2/5] Introduce locking in the script, during IFB setup.
      Add xenstore paths used by netbuf scripts
      to xenstore-paths.markdown

[3/5] Hotplug scripts setup/teardown invocations are now asynchronous
      following IanJ''s feedback.  However, the invocations are still
      sequential. 

[5/5] Allow per-domain specification of netbuffer scripts in xl remus
      commmand.

And minor nits throughout the series based on feedback from
the last version

Changes in V2:
[1/5] Configure script will automatically enable/disable network
      buffer support depending on the availability of the appropriate
      libnl3 version. [If libnl3 is unavailable, a warning message will be
      printed to let the user know that the feature has been disabled.]

      use macros from pkg.m4 instead of pkg-config commands
      removed redundant checks for libnl3 libraries.

[3,4/5] - Minor nits.

Version 1:

[1/5] Changes to autoconf scripts to check for libnl3. Add linker flags
      to libxl Makefile.

[2/5] External script to setup/teardown network buffering using
libnl3''s
      CLI. This script will be invoked by libxl before starting Remus.
      The script''s main job is to bring up an IFB device with plug
qdisc
      attached to it.  It then re-routes egress traffic from the
guest''s
      vif to the IFB device.

[3/5] Libxl code to invoke the external setup script, followed by netlink
      related setup to obtain a handle on the output buffers attached
      to each vif.

[4/5] Libxl interaction with network buffer module in the kernel via
      libnl3 API.

[5/5] xl cmdline switch to explicitly enable network buffering when
      starting remus.


  Few things to note: 

    a) Based on previous email discussions, the setup/teardown task has
    been moved to a hotplug style shell script which can be customized as
    desired, instead of implementing it as C code inside libxl.

    b) Libnl3 is not available on NetBSD. Nor is it available on CentOS
   (Linux).  So I have made network buffering support an optional feature
   so that it can be disabled if desired.

   c) NetBSD does not have libnl3. So I have put the setup script under
   tools/hotplug/Linux folder.

thanks
shriram

Shriram Rajagopalan

2013-Oct-21 05:58 UTC

head link

[PATCH 1 of 5 V3] remus: add libnl3 dependency to autoconf scripts

# HG changeset patch
# User Shriram Rajagopalan <rshriram@cs.ubc.ca>
# Date 1381540581 25200
# Node ID 357655a1598cba8f8460758714885563dd2b4a9c
# Parent  808aba98084edee7b6dc78c7333da71d2be70386
remus: add libnl3 dependency to autoconf scripts

Libnl3 is required for controlling Remus network buffering.
This patch adds dependency on libnl3 (>= 3.2.8) to autoconf scripts.
Also provide ability to configure tools without libnl3 support, that
is without network buffering support.

Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>

diff -r 808aba98084e -r 357655a1598c config/Tools.mk.in
--- a/config/Tools.mk.in	Fri Oct 11 18:16:11 2013 -0700
+++ b/config/Tools.mk.in	Fri Oct 11 18:16:21 2013 -0700
@@ -37,6 +37,8 @@ PTHREAD_LIBS        := @PTHREAD_LIBS@
 
 PTYFUNCS_LIBS       := @PTYFUNCS_LIBS@
 
+LIBNL3_LIBS         := @LIBNL3_LIBS@
+LIBNL3_CFLAGS       := @LIBNL3_CFLAGS@
 # Download GIT repositories via HTTP or GIT''s own protocol?
 # GIT''s protocol is faster and more robust, when it works at all
(firewalls
 # may block it). We make it the default, but if your GIT repository downloads
@@ -55,6 +57,7 @@ CONFIG_QEMU_TRAD    := @qemu_traditional
 CONFIG_QEMU_XEN     := @qemu_xen@
 CONFIG_XEND         := @xend@
 CONFIG_BLKTAP1      := @blktap1@
+CONFIG_REMUS_NETBUF := @remus_netbuf@
 
 #System options
 ZLIB                := @zlib@
diff -r 808aba98084e -r 357655a1598c tools/configure.ac
--- a/tools/configure.ac	Fri Oct 11 18:16:11 2013 -0700
+++ b/tools/configure.ac	Fri Oct 11 18:16:21 2013 -0700
@@ -219,6 +219,24 @@ AC_SUBST(libiconv)
 # Checks for header files.
 AC_CHECK_HEADERS([yajl/yajl_version.h sys/eventfd.h])
 
+# Check for libnl3 >=3.2.8. If present enable remus network buffering.
+PKG_CHECK_MODULES(LIBNL3, [libnl-3.0 >= 3.2.8 libnl-route-3.0 >= 3.2.8],
+		[libnl3_lib="y"], [libnl3_lib="n"])
+
+# Check for libnl3 command line utilities
+PKG_CHECK_EXISTS([libnl-cli-3.0 >= 3.2.8], [libnl3_cli="y"],
[libnl3_cli="n"])
+
+AS_IF([test "x$libnl3_lib" = "xn" || test
"x$libnl3_cli" = "xn"], [
+	    AC_MSG_WARN([Disabling support for Remus network buffering.
+	    Please install libnl3 libraries, command line tools and devel
+	    headers - version 3.2.8 or higher])
+	    AC_SUBST(remus_netbuf, [n])
+	    ],[
+	    AC_SUBST(LIBNL3_LIBS)
+	    AC_SUBST(LIBNL3_CFLAGS)
+	    AC_SUBST(remus_netbuf, [y])
+])
+
 AC_OUTPUT()
 
 AS_IF([test "x$xend" = "xy" ], [
diff -r 808aba98084e -r 357655a1598c tools/libxl/Makefile
--- a/tools/libxl/Makefile	Fri Oct 11 18:16:11 2013 -0700
+++ b/tools/libxl/Makefile	Fri Oct 11 18:16:21 2013 -0700
@@ -21,11 +21,13 @@ endif
 
 LIBXL_LIBS  LIBXL_LIBS = $(LDLIBS_libxenctrl) $(LDLIBS_libxenguest)
$(LDLIBS_libxenstore) $(LDLIBS_libblktapctl) $(PTYFUNCS_LIBS) $(LIBUUID_LIBS)
+LIBXL_LIBS += $(LIBNL3_LIBS)
 
 CFLAGS_LIBXL += $(CFLAGS_libxenctrl)
 CFLAGS_LIBXL += $(CFLAGS_libxenguest)
 CFLAGS_LIBXL += $(CFLAGS_libxenstore)
 CFLAGS_LIBXL += $(CFLAGS_libblktapctl) 
+CFLAGS_LIBXL += $(LIBNL3_CFLAGS)
 CFLAGS_LIBXL += -Wshadow
 
 CFLAGS += $(PTHREAD_CFLAGS)
diff -r 808aba98084e -r 357655a1598c tools/remus/README
--- a/tools/remus/README	Fri Oct 11 18:16:11 2013 -0700
+++ b/tools/remus/README	Fri Oct 11 18:16:21 2013 -0700
@@ -2,3 +2,9 @@ Remus provides fault tolerance for virtu
 checkpoints to a backup, which will activate if the target VM fails.
 
 See the website at http://nss.cs.ubc.ca/remus/ for details.
+
+Using Remus with libxl on Xen 4.4 and higher:
+ To enable network buffering, you need libnl 3.2.8
+ or higher along with the development headers and command line utilities.
+ If your distro does not have the appropriate libnl3 version, you can find
+ the latest source tarball of libnl3 at
http://www.carisma.slowglass.com/~tgr/libnl/

Shriram Rajagopalan

2013-Oct-21 05:58 UTC

head link

[PATCH 2 of 5 V3] tools/hotplug: Remus network buffering setup scripts

# HG changeset patch
# User Shriram Rajagopalan <rshriram@cs.ubc.ca>
# Date 1382274772 25200
# Node ID 0001a8222a785865b753acf8bcdf97c10c9e3819
# Parent  357655a1598cba8f8460758714885563dd2b4a9c
tools/hotplug: Remus network buffering setup scripts

This patch introduces remus-netbuf-setup hotplug script responsible for
setting up and tearing down the necessary infrastructure required for
network output buffering in Remus.  This script is intended to be invoked
by libxl for each guest interface, when starting or stopping Remus.

Apart from returning success/failure indication via the usual hotplug
entries in xenstore, this script also writes to xenstore, the name of
the IFB device to be used to control the vif''s network output.

The script relies on libnl3 command line utilities to perform various
setup/teardown functions. The script is confined to Linux platforms only
since NetBSD does not seem to have libnl3.

Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>

diff -r 357655a1598c -r 0001a8222a78 docs/misc/xenstore-paths.markdown
--- a/docs/misc/xenstore-paths.markdown	Fri Oct 11 18:16:21 2013 -0700
+++ b/docs/misc/xenstore-paths.markdown	Sun Oct 20 06:12:52 2013 -0700
@@ -356,6 +356,10 @@ The guest''s virtual time offset from UTC
 
 The device model version for a domain.
 
+#### /libxl/$DOMID/remus/netbuf/$DEVID/ifb = STRING [n,INTERNAL]
+
+IFB device used by Remus to buffer network output from the associated vif.
+
 [BLKIF]:
http://xenbits.xen.org/docs/unstable/hypercall/include,public,io,blkif.h.html
 [FBIF]:
http://xenbits.xen.org/docs/unstable/hypercall/include,public,io,fbif.h.html
 [HVMPARAMS]:
http://xenbits.xen.org/docs/unstable/hypercall/include,public,hvm,params.h.html
diff -r 357655a1598c -r 0001a8222a78 tools/hotplug/Linux/Makefile
--- a/tools/hotplug/Linux/Makefile	Fri Oct 11 18:16:21 2013 -0700
+++ b/tools/hotplug/Linux/Makefile	Sun Oct 20 06:12:52 2013 -0700
@@ -16,6 +16,7 @@ XEN_SCRIPTS += network-nat vif-nat
 XEN_SCRIPTS += vif-openvswitch
 XEN_SCRIPTS += vif2
 XEN_SCRIPTS += vif-setup
+XEN_SCRIPTS-$(CONFIG_REMUS_NETBUF) += remus-netbuf-setup
 XEN_SCRIPTS += block
 XEN_SCRIPTS += block-enbd block-nbd
 XEN_SCRIPTS-$(CONFIG_BLKTAP1) += blktap
diff -r 357655a1598c -r 0001a8222a78 tools/hotplug/Linux/remus-netbuf-setup
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/hotplug/Linux/remus-netbuf-setup	Sun Oct 20 06:12:52 2013 -0700
@@ -0,0 +1,187 @@
+#!/bin/bash
+#===========================================================================+#
${XEN_SCRIPT_DIR}/remus-netbuf-setup
+#
+# Script for attaching a network buffer to the specified vif (in any mode).
+# The hotplugging system will call this script when starting remus via libxl
+# API, libxl_domain_remus_start.
+#
+# Usage:
+# remus-netbuf-setup (setup|teardown)
+#
+# Environment vars:
+# vifname     vif interface name (required).
+# XENBUS_PATH path in Xenstore, where the IFB device details will be stored
+#                      or read from (required).
+#             (libxl passes /libxl/<domid>/remus/netbuf/<devid>)
+# IFB	      ifb interface to be cleaned up (required). [for teardown op only]
+
+# Written to the store: (setup operation)
+# XENBUS_PATH/ifb=<ifbdevName> the IFB device serving
+#  as the intermediate buffer through which the interface''s network
output
+#  can be controlled.
+#
+# To install a network buffer on a guest vif (vif1.0) using ifb (ifb0)
+# we need to do the following
+#
+#  ip link set dev ifb0 up
+#  tc qdisc add dev vif1.0 ingress
+#  tc filter add dev vif1.0 parent ffff: proto ip \
+#    prio 10 u32 match u32 0 0 action mirred egress redirect dev ifb0
+#  nl-qdisc-add --dev=ifb0 --parent root plug
+#  nl-qdisc-add --dev=ifb0 --parent root --update plug --limit=10000000
+#                                                (10MB limit on buffer)
+#
+# So order of operations when installing a network buffer on vif1.0
+# 1. find a free ifb and bring up the device
+# 2. redirect traffic from vif1.0 to ifb:
+#   2.1 add ingress qdisc to vif1.0 (to capture outgoing packets from guest)
+#   2.2 use tc filter command with actions mirred egress + redirect
+# 3. install plug_qdisc on ifb device, with which we can buffer/release
+#    guest''s network output from vif1.0
+#
+#
+
+#===========================================================================+
+# Unlike other vif scripts, vif-common is not needed here as it executes vif
+#specific setup code such as renaming.
+dir=$(dirname "$0")
+. "$dir/xen-hotplug-common.sh"
+
+findCommand "$@"
+
+if [ "$command" != "setup" -a  "$command" !=
"teardown" ]
+then
+  echo "Invalid command: $command"
+  log err "Invalid command: $command"
+  exit 1
+fi
+
+evalVariables "$@"
+
+: ${vifname?}
+: ${XENBUS_PATH?}
+
+check_libnl_tools() {
+    if ! command -v nl-qdisc-list > /dev/null 2>&1; then
+        fatal "Unable to find nl-qdisc-list tool"
+    fi
+    if ! command -v nl-qdisc-add > /dev/null 2>&1; then
+        fatal "Unable to find nl-qdisc-add tool"
+    fi
+    if ! command -v nl-qdisc-delete > /dev/null 2>&1; then
+        fatal "Unable to find nl-qdisc-delete tool"
+    fi
+}
+
+# We only check for modules. We don''t load them.
+# User/Admin is supposed to load ifb during boot time,
+# ensuring that there are enough free ifbs in the system.
+# Other modules will be loaded automatically by tc commands.
+check_modules() {
+    for m in ifb sch_plug sch_ingress act_mirred cls_u32
+    do
+	if ! modinfo $m > /dev/null 2>&1; then
+	    fatal "Unable to find $m kernel module"
+	fi
+    done
+}
+
+setup_ifb() {
+
+    for ifb in `ifconfig -a -s|egrep ^ifb|cut -d '' '' -f1`
+    do
+	local installed=`nl-qdisc-list -d $ifb`
+	[ -n "$installed" ] && continue
+	IFB="$ifb"
+	break
+    done
+
+    if [ -z "$IFB" ]
+    then
+	fatal "Unable to find a free IFB device for $vifname"
+    fi
+
+    do_or_die ip link set dev "$IFB" up
+}
+
+redirect_vif_traffic() {
+    local vif=$1
+    local ifb=$2
+
+    do_or_die tc qdisc add dev "$vif" ingress
+
+    tc filter add dev "$vif" parent ffff: proto ip prio 10 \
+	u32 match u32 0 0 action mirred egress redirect dev "$ifb"
>/dev/null 2>&1
+
+    if [ $? -ne 0 ]
+    then
+	do_without_error tc qdisc del dev "$vif" ingress
+	fatal "Failed to redirect traffic from $vif to $ifb"
+    fi
+}
+
+add_plug_qdisc() {
+    local vif=$1
+    local ifb=$2
+
+    nl-qdisc-add --dev="$ifb" --parent root plug >/dev/null
2>&1
+    if [ $? -ne 0 ]
+    then
+	do_without_error tc qdisc del dev "$vif" ingress
+	fatal "Failed to add plug qdisc to $ifb"
+    fi
+
+    #set ifb buffering limit in bytes. Its okay if this command
+    nl-qdisc-add --dev="$ifb" --parent root \
+	--update plug --limit=10000000 >/dev/null 2>&1
+}
+
+teardown_netbuf() {
+    local vif=$1
+    local ifb=$2
+
+    if [ "$ifb" ]; then
+	do_without_error ip link set dev "$ifb" down
+	do_without_error nl-qdisc-delete --dev="$ifb" --parent root plug
>/dev/null 2>&1
+	xenstore-rm -t "$XENBUS_PATH/ifb" 2>/dev/null || true
+    fi
+    do_without_error tc qdisc del dev "$vif" ingress
+    xenstore-rm -t "$XENBUS_PATH/hotplug-status" 2>/dev/null ||
true
+}
+
+xs_write_failed() {
+    local vif=$1
+    local ifb=$2
+    teardown_netbuf "$vifname" "$IFB"
+    fatal "failed to write ifb name to xenstore"
+}
+
+case "$command" in
+    setup)
+	check_libnl_tools
+	check_modules
+
+	claim_lock "pickifb"
+	setup_ifb
+	redirect_vif_traffic "$vifname" "$IFB"
+	add_plug_qdisc "$vifname" "$IFB"
+	release_lock "pickifb"
+
+	#not using xenstore_write that automatically exits on error
+	# because we need to cleanup
+	_xenstore_write "$XENBUS_PATH/ifb" "$IFB" ||
xs_write_failed "$vifname" "$IFB"
+	;;
+    teardown)
+	: ${IFB?}
+	teardown_netbuf "$vifname" "$IFB"
+	;;
+esac
+
+log debug "Successful remus-netbuf-setup $command for $vifname, ifb
$IFB."
+
+if [ "$command" = "setup" ]
+then
+  success
+fi

Shriram Rajagopalan

2013-Oct-21 05:58 UTC

head link

[PATCH 3 of 5 V3] tools/libxl: setup/teardown Remus network buffering

# HG changeset patch
# User Shriram Rajagopalan <rshriram@cs.ubc.ca>
# Date 1382295266 25200
# Node ID d3f088236c550213fc04ed982df47b4771b28d2f
# Parent  0001a8222a785865b753acf8bcdf97c10c9e3819
tools/libxl: setup/teardown Remus network buffering

Setup/teardown remus network buffering for a given guest, when
libxl_domain_remus_start API is invoked.

This patch does the following during setup:
 a) call the hotplug script for each vif to setup its network buffer

 b) establish a dedicated remus context containing libnl related
    state (netlink sockets, qdisc caches, etc.,)

 c) Obtain handles to plug qdiscs installed on the IFB devices
    chosen by the hotplug scripts.

During teardown, the netlink resources are released, followed by
invocation of hotplug scripts to remove the IFB devices.

Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>

diff -r 0001a8222a78 -r d3f088236c55 tools/libxl/Makefile
--- a/tools/libxl/Makefile	Sun Oct 20 06:12:52 2013 -0700
+++ b/tools/libxl/Makefile	Sun Oct 20 11:54:26 2013 -0700
@@ -42,6 +42,13 @@ LIBXL_OBJS-y += libxl_blktap2.o
 else
 LIBXL_OBJS-y += libxl_noblktap2.o
 endif
+
+ifeq ($(CONFIG_REMUS_NETBUF),y)
+LIBXL_OBJS-y += libxl_netbuffer.o
+else
+LIBXL_OBJS-y += libxl_nonetbuffer.o
+endif
+
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o
 LIBXL_OBJS-$(CONFIG_IA64) += libxl_nocpuid.o libxl_noarch.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_noarch.o
diff -r 0001a8222a78 -r d3f088236c55 tools/libxl/libxl.c
--- a/tools/libxl/libxl.c	Sun Oct 20 06:12:52 2013 -0700
+++ b/tools/libxl/libxl.c	Sun Oct 20 11:54:26 2013 -0700
@@ -698,8 +698,9 @@ libxl_vminfo * libxl_list_vm(libxl_ctx *
     return ptr;
 }
 
-static void remus_failover_cb(libxl__egc *egc,
-                              libxl__domain_suspend_state *dss, int rc);
+static void remus_replication_failure_cb(libxl__egc *egc,
+                                         libxl__domain_suspend_state *dss,
+                                         int rc);
 
 /* TODO: Explicit Checkpoint acknowledgements via recv_fd. */
 int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
@@ -718,40 +719,57 @@ int libxl_domain_remus_start(libxl_ctx *
 
     GCNEW(dss);
     dss->ao = ao;
-    dss->callback = remus_failover_cb;
+    dss->callback = remus_replication_failure_cb;
     dss->domid = domid;
     dss->fd = send_fd;
     /* TODO do something with recv_fd */
     dss->type = type;
     dss->live = 1;
     dss->debug = 0;
-    dss->remus = info;
 
     assert(info);
 
-    /* TBD: Remus setup - i.e. attach qdisc, enable disk buffering, etc */
+    GCNEW(dss->remus_ctx);
+
+    /* convenience shorthand */
+    libxl__remus_ctx *remus_ctx = dss->remus_ctx;
+    remus_ctx->blackhole = info->blackhole;
+    remus_ctx->interval = info->interval;
+    remus_ctx->compression = info->compression;
+    remus_ctx->dss = dss;
+
+    /* TODO: enable disk buffering */
+
+    /* Setup network buffering */
+    if (info->netbuf) {
+        if (!libxl__netbuffer_enabled(gc)) {
+            LOG(ERROR, "Remus: No support for network buffering");
+            goto out;
+        }
+
+        if (info->netbufscript) {
+            remus_ctx->netbufscript +                libxl__strdup(gc,
info->netbufscript);
+        } else {
+            remus_ctx->netbufscript +               
GCSPRINTF("%s/remus-netbuf-setup",
+                          libxl__xen_script_dir_path());
+        }
+    }
 
     /* Point of no return */
-    libxl__domain_suspend(egc, dss);
+    libxl__remus_setup_initiate(egc, dss);
     return AO_INPROGRESS;
 
  out:
     return AO_ABORT(rc);
 }
 
-static void remus_failover_cb(libxl__egc *egc,
-                              libxl__domain_suspend_state *dss, int rc)
+static void remus_replication_failure_cb(libxl__egc *egc,
+                                         libxl__domain_suspend_state *dss,
+                                         int rc)
 {
     STATE_AO_GC(dss->ao);
-    /*
-     * With Remus, if we reach this point, it means either
-     * backup died or some network error occurred preventing us
-     * from sending checkpoints.
-     */
-
-    /* TBD: Remus cleanup - i.e. detach qdisc, release other
-     * resources.
-     */
     libxl__ao_complete(egc, ao, rc);
 }
 
diff -r 0001a8222a78 -r d3f088236c55 tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c	Sun Oct 20 06:12:52 2013 -0700
+++ b/tools/libxl/libxl_dom.c	Sun Oct 20 11:54:26 2013 -0700
@@ -1211,6 +1211,50 @@ int libxl__toolstack_save(uint32_t domid
     return 0;
 }
 
+/*----- remus setup/teardown code -----*/
+void libxl__remus_setup_initiate(libxl__egc *egc,
+                                 libxl__domain_suspend_state *dss)
+{
+    if (!dss->remus_ctx->netbufscript)
+        libxl__remus_setup_done(egc, dss, 0);
+    else
+        libxl__remus_netbuf_setup(egc, dss);
+}
+
+void libxl__remus_setup_done(libxl__egc *egc,
+                             libxl__domain_suspend_state *dss,
+                             int rc)
+{
+    STATE_AO_GC(dss->ao);
+    if (!rc) {
+        libxl__domain_suspend(egc, dss);
+        return;
+    }
+
+    LOG(ERROR, "Remus: failed to setup network buffering"
+        " for guest with domid %u", dss->domid);
+    domain_suspend_done(egc, dss, rc);
+}
+
+void libxl__remus_teardown_initiate(libxl__egc *egc,
+                                    libxl__domain_suspend_state *dss,
+                                    int rc)
+{
+    /* stash rc somewhere before invoking teardown ops. */
+    dss->remus_ctx->saved_rc = rc;
+
+    if (!dss->remus_ctx->netbuf_ctx)
+        libxl__remus_teardown_done(egc, dss);
+    else
+        libxl__remus_netbuf_teardown(egc, dss);
+}
+
+void libxl__remus_teardown_done(libxl__egc *egc,
+                                libxl__domain_suspend_state *dss)
+{
+    dss->callback(egc, dss, dss->remus_ctx->saved_rc);
+}
+
 /*----- remus callbacks -----*/
 
 static int libxl__remus_domain_suspend_callback(void *data)
@@ -1259,7 +1303,7 @@ static void remus_checkpoint_dm_saved(li
     /* REMUS TODO: Wait for disk and memory ack, release network buffer */
     /* REMUS TODO: make this asynchronous */
     assert(!rc); /* REMUS TODO handle this error properly */
-    usleep(dss->interval * 1000);
+    usleep(dss->remus_ctx->interval * 1000);
     libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 1);
 }
 
@@ -1277,7 +1321,6 @@ void libxl__domain_suspend(libxl__egc *e
     const libxl_domain_type type = dss->type;
     const int live = dss->live;
     const int debug = dss->debug;
-    const libxl_domain_remus_info *const r_info = dss->remus;
     libxl__srm_save_autogen_callbacks *const callbacks         
&dss->shs.callbacks.save.a;
 
@@ -1312,11 +1355,8 @@ void libxl__domain_suspend(libxl__egc *e
     dss->guest_responded = 0;
     dss->dm_savefile = libxl__device_model_savefile(gc, domid);
 
-    if (r_info != NULL) {
-        dss->interval = r_info->interval;
-        if (r_info->compression)
-            dss->xcflags |= XCFLAGS_CHECKPOINT_COMPRESS;
-    }
+    if (dss->remus_ctx && dss->remus_ctx->compression)
+        dss->xcflags |= XCFLAGS_CHECKPOINT_COMPRESS;
 
     dss->xce = xc_evtchn_open(NULL, 0);
     if (dss->xce == NULL)
@@ -1335,7 +1375,7 @@ void libxl__domain_suspend(libxl__egc *e
     }
 
     memset(callbacks, 0, sizeof(*callbacks));
-    if (r_info != NULL) {
+    if (dss->remus_ctx != NULL) {
         callbacks->suspend = libxl__remus_domain_suspend_callback;
         callbacks->postcopy = libxl__remus_domain_resume_callback;
         callbacks->checkpoint = libxl__remus_domain_checkpoint_callback;
@@ -1495,6 +1535,17 @@ static void domain_suspend_done(libxl__e
     if (dss->xce != NULL)
         xc_evtchn_close(dss->xce);
 
+    if (dss->remus_ctx) {
+        /*
+         * With Remus, if we reach this point, it means either
+         * backup died or some network error occurred preventing us
+         * from sending checkpoints. Teardown the network buffers and
+         * release netlink resources.  This is an async op.
+         */
+        libxl__remus_teardown_initiate(egc, dss, rc);
+        return;
+    }
+
     dss->callback(egc, dss, rc);
 }
 
diff -r 0001a8222a78 -r d3f088236c55 tools/libxl/libxl_internal.h
--- a/tools/libxl/libxl_internal.h	Sun Oct 20 06:12:52 2013 -0700
+++ b/tools/libxl/libxl_internal.h	Sun Oct 20 11:54:26 2013 -0700
@@ -2242,6 +2242,50 @@ typedef struct libxl__logdirty_switch {
     libxl__ev_time timeout;
 } libxl__logdirty_switch;
 
+typedef struct libxl__remus_ctx {
+    /* checkpoint interval */
+    int interval;
+    int blackhole;
+    int compression;
+    int saved_rc;
+    /* Script to setup/teardown network buffers */
+    const char *netbufscript;
+    /* Opaque context containing network buffer related stuff */
+    void *netbuf_ctx;
+    libxl__domain_suspend_state *dss;
+    libxl__ev_time timeout;
+    libxl__ev_child child;
+    int num_exec;
+} libxl__remus_ctx;
+
+_hidden int libxl__netbuffer_enabled(libxl__gc *gc);
+
+_hidden void libxl__remus_setup_initiate(libxl__egc *egc,
+                                         libxl__domain_suspend_state *dss);
+
+_hidden void libxl__remus_setup_done(libxl__egc *egc,
+                                     libxl__domain_suspend_state *dss,
+                                     int rc);
+
+_hidden void libxl__remus_teardown_initiate(libxl__egc *egc,
+                                            libxl__domain_suspend_state *dss,
+                                            int rc);
+
+_hidden void libxl__remus_teardown_done(libxl__egc *egc,
+                                        libxl__domain_suspend_state *dss);
+
+_hidden void libxl__remus_netbuf_setup(libxl__egc *egc,
+                                       libxl__domain_suspend_state *dss);
+
+_hidden void libxl__remus_netbuf_teardown(libxl__egc *egc,
+                                          libxl__domain_suspend_state *dss);
+
+_hidden int libxl__remus_netbuf_start_new_epoch(libxl__gc *gc, uint32_t domid,
+                                               libxl__remus_ctx *remus_ctx);
+
+_hidden int libxl__remus_netbuf_release_prev_epoch(libxl__gc *gc, uint32_t
domid,
+                                                  libxl__remus_ctx *remus_ctx);
+
 struct libxl__domain_suspend_state {
     /* set by caller of libxl__domain_suspend */
     libxl__ao *ao;
@@ -2252,7 +2296,7 @@ struct libxl__domain_suspend_state {
     libxl_domain_type type;
     int live;
     int debug;
-    const libxl_domain_remus_info *remus;
+    libxl__remus_ctx *remus_ctx;
     /* private */
     xc_evtchn *xce; /* event channel handle */
     int suspend_eventchn;
@@ -2260,7 +2304,6 @@ struct libxl__domain_suspend_state {
     int xcflags;
     int guest_responded;
     const char *dm_savefile;
-    int interval; /* checkpoint interval (for Remus) */
     libxl__save_helper_state shs;
     libxl__logdirty_switch logdirty;
     /* private for libxl__domain_save_device_model */
diff -r 0001a8222a78 -r d3f088236c55 tools/libxl/libxl_netbuffer.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/libxl/libxl_netbuffer.c	Sun Oct 20 11:54:26 2013 -0700
@@ -0,0 +1,516 @@
+/*
+ * Copyright (C) 2013
+ * Author Shriram Rajagopalan <rshriram@cs.ubc.ca>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+
+#include <netlink/cache.h>
+#include <netlink/socket.h>
+#include <netlink/attr.h>
+#include <netlink/route/link.h>
+#include <netlink/route/route.h>
+#include <netlink/route/qdisc.h>
+#include <netlink/route/qdisc/plug.h>
+
+typedef struct libxl__remus_netbuf_ctx {
+    struct rtnl_qdisc **netbuf_qdisc_list;
+    struct nl_sock *nlsock;
+    struct nl_cache *qdisc_cache;
+    char **vif_list;
+    char **ifb_list;
+    uint32_t num_netbufs;
+    uint32_t unused;
+} libxl__remus_netbuf_ctx;
+
+/* If the device has a vifname, then use that instead of
+ * the vifX.Y format.
+ */
+static char *get_vifname(libxl__gc *gc, uint32_t domid,
+                               libxl_device_nic *nic)
+{
+    const char *vifname = NULL;
+    vifname = libxl__xs_read(gc, XBT_NULL,
+                             libxl__sprintf(gc,
+                                           
"%s/backend/vif/%d/%d/vifname",
+                                            libxl__xs_get_dompath(gc, 0),
+                                            domid, nic->devid));
+    if (!vifname) {
+        vifname = libxl__device_nic_devname(gc, domid,
+                                            nic->devid,
+                                            nic->nictype);
+    }
+
+    return libxl__strdup(gc, vifname);
+}
+
+static char **get_guest_vif_list(libxl__gc *gc, uint32_t domid,
+                                 int *num_vifs)
+{
+    libxl_device_nic *nics = NULL;
+    int nb, i = 0;
+    char **vif_list = NULL;
+
+    *num_vifs = 0;
+    nics = libxl_device_nic_list(CTX, domid, &nb);
+    if (!nics) return NULL;
+
+    /* Ensure that none of the vifs are backed by driver domains */
+    for (i = 0; i < nb; i++) {
+        if (nics[i].backend_domid != LIBXL_TOOLSTACK_DOMID) {
+            LOG(ERROR, "vif %s has driver domain (%u) as its
backend.\n"
+                "Network buffering is not supported with driver
domains",
+                get_vifname(gc, domid, &nics[i]), nics[i].backend_domid);
+            *num_vifs = -1;
+            goto end;
+        }
+    }
+
+    vif_list = libxl__calloc(gc, nb, sizeof(char *));
+    for (i = 0; i < nb; ++i)
+        vif_list[i] = get_vifname(gc, domid, &nics[i]);
+    *num_vifs = nb;
+
+ end:
+    for (i = 0; i < nb; i++)
+        libxl_device_nic_dispose(&nics[i]);
+    free(nics);
+    return vif_list;
+}
+
+static int init_qdiscs(libxl__gc *gc,
+                       libxl__remus_ctx *remus_ctx);
+
+static int exec_netbuf_script(libxl__gc *gc, libxl__remus_ctx *remus_ctx,
+                              char *op, libxl__ev_child_callback *death);
+
+static void netbuf_setup_script_cb(libxl__egc *egc,
+                                   libxl__ev_child *child,
+                                   pid_t pid, int status);
+
+static void netbuf_teardown_script_cb(libxl__egc *egc,
+                                      libxl__ev_child *child,
+                                      pid_t pid, int status);
+
+static void netbuf_setup_timeout_cb(libxl__egc *egc,
+                                    libxl__ev_time *ev,
+                                    const struct timeval *requested_abs);
+
+static int init_qdiscs(libxl__gc *gc,
+                       libxl__remus_ctx *remus_ctx)
+{
+    int i, ret, ifindex;
+    struct rtnl_link *ifb = NULL;
+    struct rtnl_qdisc *qdisc = NULL;
+
+    /* convenience vars */
+    libxl__remus_netbuf_ctx *netbuf_ctx = remus_ctx->netbuf_ctx;
+    int num_netbufs = netbuf_ctx->num_netbufs;
+    char **ifb_list = netbuf_ctx->ifb_list;
+
+    /* Now that we have brought up IFB devices with plug qdisc for
+     * each vif, lets get a netlink handle on the plug qdisc for use
+     * during checkpointing.
+     */
+    netbuf_ctx->nlsock = nl_socket_alloc();
+    if (!netbuf_ctx->nlsock) {
+        LOG(ERROR, "cannot allocate nl socket");
+        goto end;
+    }
+
+    ret = nl_connect(netbuf_ctx->nlsock, NETLINK_ROUTE);
+    if (ret) {
+        LOG(ERROR, "failed to open netlink socket: %s",
+            nl_geterror(ret));
+        goto end;
+    }
+
+    /* get list of all qdiscs installed on network devs. */
+    ret = rtnl_qdisc_alloc_cache(netbuf_ctx->nlsock,
+                                 &netbuf_ctx->qdisc_cache);
+    if (ret) {
+        LOG(ERROR, "failed to allocate qdisc cache: %s",
+            nl_geterror(ret));
+        goto end;
+    }
+
+    /* list of handles to plug qdiscs */
+    netbuf_ctx->netbuf_qdisc_list +        libxl__calloc(gc, num_netbufs,
+                      sizeof(struct rtnl_qdisc *));
+
+    ifb = rtnl_link_alloc();
+
+    for (i = 0; i < num_netbufs; ++i) {
+
+        /* get a handle to the IFB interface */
+        ret = rtnl_link_get_kernel(netbuf_ctx->nlsock, 0,
+                                   ifb_list[i], &ifb);
+        if (ret) {
+            LOG(ERROR, "cannot obtain handle for %s: %s",
ifb_list[i],
+                nl_geterror(ret));
+            goto end;
+        }
+
+        ifindex = rtnl_link_get_ifindex(ifb);
+        if (!ifindex) {
+            LOG(ERROR, "interface %s has no index", ifb_list[i]);
+            goto end;
+        }
+
+        /* Get a reference to the root qdisc installed on the IFB, by
+         * querying the qdisc list we obtained earlier. The netbufscript
+         * sets up the plug qdisc as the root qdisc, so we don''t have
to
+         * search the entire qdisc tree on the IFB dev.
+
+         * There is no need to explicitly free this qdisc as its just a
+         * reference from the qdisc cache we allocated earlier.
+         */
+        qdisc = rtnl_qdisc_get_by_parent(netbuf_ctx->qdisc_cache, ifindex,
+                                         TC_H_ROOT);
+
+        if (qdisc) {
+            const char *tc_kind = rtnl_tc_get_kind(TC_CAST(qdisc));
+            /* Sanity check: Ensure that the root qdisc is a plug qdisc. */
+            if (!tc_kind || strcmp(tc_kind, "plug")) {
+                LOG(ERROR, "plug qdisc is not installed on %s",
ifb_list[i]);
+                goto end;
+            }
+            netbuf_ctx->netbuf_qdisc_list[i] = qdisc;
+        } else {
+            LOG(ERROR, "Cannot get qdisc handle from ifb %s",
ifb_list[i]);
+            goto end;
+        }
+    }
+
+    rtnl_link_put(ifb);
+    return 0;
+
+ end:
+    if (ifb) rtnl_link_put(ifb);
+    return ERROR_FAIL;
+}
+
+/* the script needs the following env & args
+ * $vifname
+ * $XENBUS_PATH (/libxl/<domid>/remus/netbuf/<devid>/)
+ * $IFB (for teardown)
+ * setup/teardown as command line arg.
+ * In return, the script writes the name of IFB device (during setup) to be
+ * used for output buffering into XENBUS_PATH/ifb
+ */
+static int exec_netbuf_script(libxl__gc *gc, libxl__remus_ctx *remus_ctx,
+                              char *op, libxl__ev_child_callback *death)
+{
+    int arraysize, nr = 0;
+    char **env = NULL, **args = NULL;
+    pid_t pid;
+    libxl__ev_child *child = &remus_ctx->child;
+    libxl__ev_time *timeout = &remus_ctx->timeout;
+    char *script = libxl__strdup(gc, remus_ctx->netbufscript);
+    uint32_t domid = remus_ctx->dss->domid;
+    int devid = remus_ctx->num_exec;
+    libxl__remus_netbuf_ctx *netbuf_ctx = remus_ctx->netbuf_ctx;
+    char *vif = netbuf_ctx->vif_list[devid];
+    char *ifb = netbuf_ctx->ifb_list[devid];
+
+    if (!strcmp(op, "setup"))
+        arraysize = 5;
+    else
+        arraysize = 7;
+
+    GCNEW_ARRAY(env, arraysize);
+    env[nr++] = "vifname";
+    env[nr++] = vif;
+    env[nr++] = "XENBUS_PATH";
+    env[nr++] = GCSPRINTF("%s/remus/netbuf/%d",
+                          libxl__xs_libxl_path(gc, domid), devid);
+    if (!strcmp(op, "teardown")) {
+        env[nr++] = "IFB";
+        env[nr++] = ifb;
+    }
+    env[nr++] = NULL;
+    assert(nr <= arraysize);
+
+    arraysize = 3; nr = 0;
+    GCNEW_ARRAY(args, arraysize);
+    args[nr++] = script;
+    args[nr++] = op;
+    args[nr++] = NULL;
+    assert(nr == arraysize);
+
+    libxl__ev_child_init(child);
+
+    /* Set hotplug timeout */
+    if (libxl__ev_time_register_rel(gc, timeout,
+                                    netbuf_setup_timeout_cb,
+                                    LIBXL_HOTPLUG_TIMEOUT * 1000)) {
+        LOG(ERROR, "unable to register timeout for "
+            "netbuf setup script %s on vif %s", script, vif);
+        return ERROR_FAIL;
+    }
+
+    LOG(DEBUG, "Calling netbuf script: %s %s on vif %s",
+        script, op, vif);
+
+    /* Fork and exec netbuf script */
+    pid = libxl__ev_child_fork(gc, child, death);
+    if (pid == -1) {
+        LOG(ERROR, "unable to fork netbuf script %s", script);
+        return ERROR_FAIL;
+    }
+
+    if (!pid) {
+        /* child: Launch netbuf script */
+        libxl__exec(gc, -1, -1, -1, args[0], args, env);
+        /* notreached */
+        abort();
+    }
+
+    return 0;
+}
+
+static void netbuf_teardown_script_cb(libxl__egc *egc,
+                                      libxl__ev_child *child,
+                                      pid_t pid, int status)
+{
+    libxl__remus_ctx *remus_ctx = CONTAINER_OF(child, *remus_ctx, child);
+    libxl__remus_netbuf_ctx *netbuf_ctx = remus_ctx->netbuf_ctx;
+
+    STATE_AO_GC(remus_ctx->dss->ao);
+
+    if (status) {
+        libxl_report_child_exitstatus(CTX, LIBXL__LOG_ERROR,
+                                      remus_ctx->netbufscript,
+                                      pid, status);
+    }
+
+    remus_ctx->num_exec++;
+    if (remus_ctx->num_exec < netbuf_ctx->num_netbufs) {
+        if (exec_netbuf_script(gc, remus_ctx,
+                               "teardown",
netbuf_teardown_script_cb))
+            goto end;
+        return;
+    }
+
+ end:
+    libxl__remus_teardown_done(egc, remus_ctx->dss);
+}
+
+static void netbuf_setup_script_cb(libxl__egc *egc,
+                                   libxl__ev_child *child,
+                                   pid_t pid, int status)
+{
+    libxl__remus_ctx *remus_ctx = CONTAINER_OF(child, *remus_ctx, child);
+    char *out_path_base, *hotplug_error;
+    uint32_t domid = remus_ctx->dss->domid;
+    int devid = remus_ctx->num_exec;
+    libxl__remus_netbuf_ctx *netbuf_ctx = remus_ctx->netbuf_ctx;
+    char *vif = netbuf_ctx->vif_list[devid];
+    char **ifb = &netbuf_ctx->ifb_list[devid];
+    int rc = ERROR_FAIL;
+
+    STATE_AO_GC(remus_ctx->dss->ao);
+
+    libxl__ev_time_deregister(gc, &remus_ctx->timeout);
+
+    out_path_base = GCSPRINTF("%s/remus/netbuf/%d",
+                              libxl__xs_libxl_path(gc, domid), devid);
+
+    hotplug_error = libxl__xs_read(gc, XBT_NULL,
+                                   GCSPRINTF("%s/hotplug-error",
+                                             out_path_base));
+    if (hotplug_error) {
+        LOG(ERROR, "netbuf script %s setup failed for vif %s: %s",
+            remus_ctx->netbufscript,
+            netbuf_ctx->vif_list[devid], hotplug_error);
+        goto end;
+    }
+
+    if (status) {
+        libxl_report_child_exitstatus(CTX, LIBXL__LOG_ERROR,
+                                      remus_ctx->netbufscript,
+                                      pid, status);
+        goto end;
+    }
+
+    *ifb = libxl__xs_read(gc, XBT_NULL,
+                          GCSPRINTF("%s/remus/netbuf/%d/ifb",
+                                    libxl__xs_libxl_path(gc, domid),
+                                    devid));
+    if (!(*ifb)) {
+        LOG(ERROR, "Cannot get ifb dev name for domain %u dev %s",
+            domid, vif);
+        goto end;
+    }
+
+    LOG(DEBUG, "%s will buffer packets from vif %s", *ifb, vif);
+    remus_ctx->num_exec++;
+    if (remus_ctx->num_exec < netbuf_ctx->num_netbufs) {
+        if (exec_netbuf_script(gc, remus_ctx,
+                               "setup", netbuf_setup_script_cb))
+            goto end;
+        return;
+    }
+
+    rc = init_qdiscs(gc, remus_ctx);
+ end:
+    libxl__remus_setup_done(egc, remus_ctx->dss, rc);
+}
+
+static void netbuf_setup_timeout_cb(libxl__egc *egc,
+                                    libxl__ev_time *ev,
+                                    const struct timeval *requested_abs)
+{
+    libxl__remus_ctx *remus_ctx = CONTAINER_OF(ev, *remus_ctx, timeout);
+    int devid = remus_ctx->num_exec;
+    libxl__remus_netbuf_ctx *netbuf_ctx = remus_ctx->netbuf_ctx;
+    char *vif = netbuf_ctx->vif_list[devid];
+
+    STATE_AO_GC(remus_ctx->dss->ao);
+
+    libxl__ev_time_deregister(gc, &remus_ctx->timeout);
+    assert(libxl__ev_child_inuse(&remus_ctx->child));
+
+    LOG(DEBUG, "killing hotplug script %s (on vif %s) because of
timeout",
+        remus_ctx->netbufscript, vif);
+
+    if (kill(remus_ctx->child.pid, SIGKILL)) {
+        LOGEV(ERROR, errno, "unable to kill hotplug script %s [%ld]",
+              remus_ctx->netbufscript,
+              (unsigned long)remus_ctx->child.pid);
+    }
+
+    return;
+}
+
+int libxl__netbuffer_enabled(libxl__gc *gc)
+{
+    return 1;
+}
+
+/* Scan through the list of vifs belonging to domid and
+ * invoke the netbufscript to setup the IFB device & plug qdisc
+ * for each vif. Then scan through the list of IFB devices to obtain
+ * a handle on the plug qdisc installed on these IFB devices.
+ * Network output buffering is controlled via these qdiscs.
+ */
+void libxl__remus_netbuf_setup(libxl__egc *egc,
+                               libxl__domain_suspend_state *dss)
+{
+    libxl__remus_netbuf_ctx *netbuf_ctx = NULL;
+    uint32_t domid = dss->domid;
+    libxl__remus_ctx *remus_ctx = dss->remus_ctx;
+    int num_netbufs = 0;
+    int rc = ERROR_FAIL;
+
+    STATE_AO_GC(dss->ao);
+
+    netbuf_ctx = libxl__zalloc(gc, sizeof(libxl__remus_netbuf_ctx));
+    netbuf_ctx->vif_list = get_guest_vif_list(gc, domid, &num_netbufs);
+    if (!num_netbufs) {
+        rc = 0;
+        goto end;
+    }
+
+    if (num_netbufs < 0) goto end;
+
+    netbuf_ctx->ifb_list = libxl__calloc(gc, num_netbufs,
+                                         sizeof(char *));
+    netbuf_ctx->num_netbufs = num_netbufs;
+    remus_ctx->netbuf_ctx = netbuf_ctx;
+    remus_ctx->num_exec = 0; //start at devid == 0
+    if (exec_netbuf_script(gc, remus_ctx, "setup",
+                           netbuf_setup_script_cb))
+        goto end;
+    return;
+
+ end:
+    libxl__remus_setup_done(egc, dss, rc);
+}
+
+/* Note: This function will be called in the same gc context as
+ * libxl__remus_netbuf_setup, created during the libxl_domain_remus_start
+ * API call.
+ */
+void libxl__remus_netbuf_teardown(libxl__egc *egc,
+                                  libxl__domain_suspend_state *dss)
+{
+    libxl__remus_ctx *remus_ctx = dss->remus_ctx;
+    libxl__remus_netbuf_ctx *netbuf_ctx = remus_ctx->netbuf_ctx;
+    STATE_AO_GC(dss->ao);
+
+    if (netbuf_ctx->qdisc_cache)
+        nl_cache_free(netbuf_ctx->qdisc_cache);
+
+    if (netbuf_ctx->nlsock)
+        nl_close(netbuf_ctx->nlsock);
+
+    remus_ctx->num_exec = 0; //start at devid == 0
+    if (exec_netbuf_script(gc, remus_ctx, "teardown",
+                           netbuf_teardown_script_cb))
+        libxl__remus_teardown_done(egc, dss);
+}
+
+#define TC_BUFFER_START 1
+#define TC_BUFFER_RELEASE 2
+static int remus_netbuf_op(libxl__gc *gc, uint32_t domid,
+			   libxl__remus_ctx *remus_ctx,
+			   int buffer_op)
+{
+    int i, ret;
+    libxl__remus_netbuf_ctx *netbuf_ctx = remus_ctx->netbuf_ctx;
+
+    for (i = 0; i < netbuf_ctx->num_netbufs; ++i) {
+        if (buffer_op == TC_BUFFER_START)
+            ret = rtnl_qdisc_plug_buffer(netbuf_ctx->netbuf_qdisc_list[i]);
+        else
+            ret =
rtnl_qdisc_plug_release_one(netbuf_ctx->netbuf_qdisc_list[i]);
+
+        if (!ret)
+            ret = rtnl_qdisc_add(netbuf_ctx->nlsock,
+                                 netbuf_ctx->netbuf_qdisc_list[i],
+                                 NLM_F_REQUEST);
+        if (ret) {
+            LOG(ERROR, "Remus: cannot do netbuf op %s on %s:%s",
+                ((buffer_op == TC_BUFFER_START) ?
+                 "start_new_epoch" : "release_prev_epoch"),
+                netbuf_ctx->ifb_list[i], nl_geterror(ret));
+            return ERROR_FAIL;
+        }
+    }
+
+    return 0;
+}
+
+int libxl__remus_netbuf_start_new_epoch(libxl__gc *gc, uint32_t domid,
+					libxl__remus_ctx *remus_ctx)
+{
+    return remus_netbuf_op(gc, domid, remus_ctx, TC_BUFFER_START);
+}
+
+int libxl__remus_netbuf_release_prev_epoch(libxl__gc *gc, uint32_t domid,
+					   libxl__remus_ctx *remus_ctx)
+{
+    return remus_netbuf_op(gc, domid, remus_ctx, TC_BUFFER_RELEASE);
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff -r 0001a8222a78 -r d3f088236c55 tools/libxl/libxl_nonetbuffer.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/libxl/libxl_nonetbuffer.c	Sun Oct 20 11:54:26 2013 -0700
@@ -0,0 +1,56 @@
+/*
+ * Copyright (C) 2013
+ * Author Shriram Rajagopalan <rshriram@cs.ubc.ca>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+
+int libxl__netbuffer_enabled(libxl__gc *gc)
+{
+    return 0;
+}
+
+/* Remus network buffer related stubs */
+void libxl__remus_netbuf_setup(libxl__egc *egc,
+                               libxl__domain_suspend_state *dss)
+{
+}
+
+void libxl__remus_netbuf_teardown(libxl__egc *egc,
+                                  libxl__domain_suspend_state *dss)
+{
+}
+
+int libxl__remus_netbuf_start_new_epoch(libxl__gc *gc, uint32_t domid,
+					libxl__remus_ctx *remus_ctx)
+{
+    LOG(ERROR, "Remus: No support for network buffering");
+    return ERROR_FAIL;
+}
+
+int libxl__remus_netbuf_release_prev_epoch(libxl__gc *gc, uint32_t domid,
+					   libxl__remus_ctx *remus_ctx)
+{
+    LOG(ERROR, "Remus: No support for network buffering");
+    return ERROR_FAIL;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff -r 0001a8222a78 -r d3f088236c55 tools/libxl/libxl_types.idl
--- a/tools/libxl/libxl_types.idl	Sun Oct 20 06:12:52 2013 -0700
+++ b/tools/libxl/libxl_types.idl	Sun Oct 20 11:54:26 2013 -0700
@@ -539,6 +539,8 @@ libxl_domain_remus_info = Struct("domain
     ("interval",     integer),
     ("blackhole",    bool),
     ("compression",  bool),
+    ("netbuf",       bool),
+    ("netbufscript", string),
     ])
 
 libxl_event_type = Enumeration("event_type", [

Shriram Rajagopalan

2013-Oct-21 05:58 UTC

head link

[PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks

# HG changeset patch
# User Shriram Rajagopalan <rshriram@cs.ubc.ca>
# Date 1382295546 25200
# Node ID a8deb9499e9dcce9869025fa1c02cf2e0d58612a
# Parent  d3f088236c550213fc04ed982df47b4771b28d2f
tools/libxl: Control network buffering in remus callbacks

This patch constitutes the core network buffering logic.
and does the following:
 a) create a new network buffer when the domain is suspended
    (remus_domain_suspend_callback)
 b) release the previous network buffer pertaining to the
    committed checkpoint (remus_domain_checkpoint_dm_saved)

Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>

diff -r d3f088236c55 -r a8deb9499e9d tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c	Sun Oct 20 11:54:26 2013 -0700
+++ b/tools/libxl/libxl_dom.c	Sun Oct 20 11:59:06 2013 -0700
@@ -1259,8 +1259,24 @@ void libxl__remus_teardown_done(libxl__e
 
 static int libxl__remus_domain_suspend_callback(void *data)
 {
-    /* REMUS TODO: Issue disk and network checkpoint reqs. */
-    return libxl__domain_suspend_common_callback(data);
+    /* REMUS TODO: Issue disk checkpoint reqs. */
+    libxl__save_helper_state *shs = data;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
+    libxl__remus_ctx *remus_ctx = dss->remus_ctx;
+    bool is_suspended;
+    STATE_AO_GC(dss->ao);
+
+    is_suspended = !!libxl__domain_suspend_common_callback(data);
+
+    if (!remus_ctx->netbuf_ctx) return is_suspended;
+
+    if (is_suspended) {
+        if (libxl__remus_netbuf_start_new_epoch(gc, dss->domid,
+                                                remus_ctx))
+            return !is_suspended;
+    }
+
+    return is_suspended;
 }
 
 static int libxl__remus_domain_resume_callback(void *data)
@@ -1273,7 +1289,7 @@ static int libxl__remus_domain_resume_ca
     if (libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1))
         return 0;
 
-    /* REMUS TODO: Deal with disk. Start a new network output buffer */
+    /* REMUS TODO: Deal with disk. */
     return 1;
 }
 
@@ -1300,11 +1316,42 @@ static void libxl__remus_domain_checkpoi
 static void remus_checkpoint_dm_saved(libxl__egc *egc,
                                       libxl__domain_suspend_state *dss, int rc)
 {
-    /* REMUS TODO: Wait for disk and memory ack, release network buffer */
-    /* REMUS TODO: make this asynchronous */
-    assert(!rc); /* REMUS TODO handle this error properly */
-    usleep(dss->remus_ctx->interval * 1000);
-    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 1);
+    /*
+     * REMUS TODO: Wait for disk and explicit memory ack (through restore
+     * callback from remote) before releasing network buffer.
+     */
+    libxl__remus_ctx *remus_ctx = dss->remus_ctx;
+    struct timespec epoch;
+    int do_next_iter = 0;
+    STATE_AO_GC(dss->ao);
+
+    if (rc) {
+        LOG(ERROR, "Failed to save device model. Terminating
Remus..");
+        goto out;
+    }
+
+    if (remus_ctx->netbuf_ctx) {
+        rc = libxl__remus_netbuf_release_prev_epoch(gc, dss->domid,
+                                                     remus_ctx);
+        if (rc) {
+            LOG(ERROR, "Failed to release network buffer."
+                " Terminating Remus..");
+            goto out;
+        }
+    }
+
+    epoch.tv_sec = remus_ctx->interval / 1000; /* interval is in ms */
+    epoch.tv_nsec = remus_ctx->interval * 1000L * 1000L;
+    nanosleep(&epoch, 0);
+    /*
+     * Set return value to 1, so that the infinite checkpoint cycle
+     * continues. See xc_domain_save.c: xc_domain_save()
+     */
+    do_next_iter = 1;
+
+ out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs,
+                                                     do_next_iter);
 }
 
 /*----- main code for suspending, in order of execution -----*/

Shriram Rajagopalan

2013-Oct-21 05:58 UTC

head link

[PATCH 5 of 5 V3] tools/xl: Remus - Network buffering cmdline switch

# HG changeset patch
# User Shriram Rajagopalan <rshriram@cs.ubc.ca>
# Date 1382295556 25200
# Node ID f4eea1e0ac3ed92ed17bbd74b0705743d20b302d
# Parent  a8deb9499e9dcce9869025fa1c02cf2e0d58612a
tools/xl: Remus - Network buffering cmdline switch

Command line switch to ''xl remus'' command, to enable network
buffering.
Pass on this flag to libxl so that it can act accordingly.
Also update man pages to reflect the addition of a new option to
''xl remus'' command.

Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>

diff -r a8deb9499e9d -r f4eea1e0ac3e docs/man/xl.conf.pod.5
--- a/docs/man/xl.conf.pod.5	Sun Oct 20 11:59:06 2013 -0700
+++ b/docs/man/xl.conf.pod.5	Sun Oct 20 11:59:16 2013 -0700
@@ -105,6 +105,12 @@ Configures the default gateway device to
 
 Default: C<None>
 
+=item B<remus.default.netbufscript="PATH">
+
+Configures the default script used by Remus to setup network buffering.
+
+Default: C</etc/xen/scripts/remus-netbuf-setup>
+
 =item B<output_format="json|sxp">
 
 Configures the default output format used by xl when printing "machine
diff -r a8deb9499e9d -r f4eea1e0ac3e docs/man/xl.pod.1
--- a/docs/man/xl.pod.1	Sun Oct 20 11:59:06 2013 -0700
+++ b/docs/man/xl.pod.1	Sun Oct 20 11:59:16 2013 -0700
@@ -398,8 +398,7 @@ Print huge (!) amount of debug during th
 Enable Remus HA for domain. By default B<xl> relies on ssh as a transport
 mechanism between the two hosts.
 
-N.B: Remus support in xl is still in experimental (proof-of-concept) phase.
-     There is no support for network or disk buffering at the moment.
+N.B: There is no support for disk buffering at the moment.
 
 B<OPTIONS>
 
@@ -418,6 +417,13 @@ Generally useful for debugging.
 
 Disable memory checkpoint compression.
 
+=item B<-n>
+
+Enable network output buffering.  The default script used to configure
+network buffering is /etc/xen/scripts/remus-netbuf-setup. If you wish to
+use a custom script, set the global variable
"remus.default.netbufscript"
+in /etc/xen/xl.conf to point to your script.
+
 =item B<-s> I<sshcommand>
 
 Use <sshcommand> instead of ssh.  String will be passed to sh.
diff -r a8deb9499e9d -r f4eea1e0ac3e tools/libxl/xl.c
--- a/tools/libxl/xl.c	Sun Oct 20 11:59:06 2013 -0700
+++ b/tools/libxl/xl.c	Sun Oct 20 11:59:16 2013 -0700
@@ -46,6 +46,7 @@ char *default_vifscript = NULL;
 char *default_bridge = NULL;
 char *default_gatewaydev = NULL;
 char *default_vifbackend = NULL;
+char *default_remus_netbufscript = NULL;
 enum output_format default_output_format = OUTPUT_FORMAT_JSON;
 int claim_mode = 1;
 
@@ -177,6 +178,9 @@ static void parse_global_config(const ch
     if (!xlu_cfg_get_long (config, "claim_mode", &l, 0))
         claim_mode = l;
 
+    xlu_cfg_replace_string (config, "remus.default.netbufscript",
+                            &default_remus_netbufscript, 0);
+
     xlu_cfg_destroy(config);
 }
 
diff -r a8deb9499e9d -r f4eea1e0ac3e tools/libxl/xl.h
--- a/tools/libxl/xl.h	Sun Oct 20 11:59:06 2013 -0700
+++ b/tools/libxl/xl.h	Sun Oct 20 11:59:16 2013 -0700
@@ -152,6 +152,7 @@ extern char *default_vifscript;
 extern char *default_bridge;
 extern char *default_gatewaydev;
 extern char *default_vifbackend;
+extern char *default_remus_netbufscript;
 extern char *blkdev_start;
 
 enum output_format {
diff -r a8deb9499e9d -r f4eea1e0ac3e tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c	Sun Oct 20 11:59:06 2013 -0700
+++ b/tools/libxl/xl_cmdimpl.c	Sun Oct 20 11:59:16 2013 -0700
@@ -7111,8 +7111,9 @@ int main_remus(int argc, char **argv)
     r_info.interval = 200;
     r_info.blackhole = 0;
     r_info.compression = 1;
-
-    SWITCH_FOREACH_OPT(opt, "bui:s:e", NULL, "remus", 2) {
+    r_info.netbuf = 0;
+
+    SWITCH_FOREACH_OPT(opt, "buni:s:N:e", NULL, "remus", 2)
{
     case ''i'':
         r_info.interval = atoi(optarg);
         break;
@@ -7122,6 +7123,12 @@ int main_remus(int argc, char **argv)
     case ''u'':
         r_info.compression = 0;
         break;
+    case ''n'':
+        r_info.netbuf = 1;
+        break;
+    case ''N'':
+        r_info.netbufscript = optarg;
+        break;
     case ''s'':
         ssh_command = optarg;
         break;
@@ -7133,6 +7140,11 @@ int main_remus(int argc, char **argv)
     domid = find_domain(argv[optind]);
     host = argv[optind + 1];
 
+    if (r_info.netbuf && !r_info.netbufscript) {
+        if (default_remus_netbufscript)
+            r_info.netbufscript = default_remus_netbufscript;
+    }
+
     if (r_info.blackhole) {
         send_fd = open("/dev/null", O_RDWR, 0644);
         if (send_fd < 0) {
@@ -7170,13 +7182,10 @@ int main_remus(int argc, char **argv)
     /* Point of no return */
     rc = libxl_domain_remus_start(ctx, &r_info, domid, send_fd, recv_fd,
0);
 
-    /* If we are here, it means backup has failed/domain suspend failed.
-     * Try to resume the domain and exit gracefully.
+    /* If we are here, it means remus setup/domain suspend/backup has
+     * failed. Try to resume the domain and exit gracefully.
      * TODO: Split-Brain check.
      */
-    fprintf(stderr, "remus sender: libxl_domain_suspend failed"
-            " (rc=%d)\n", rc);
-
     if (rc == ERROR_GUEST_TIMEDOUT)
         fprintf(stderr, "Failed to suspend domain at primary.\n");
     else {
diff -r a8deb9499e9d -r f4eea1e0ac3e tools/libxl/xl_cmdtable.c
--- a/tools/libxl/xl_cmdtable.c	Sun Oct 20 11:59:06 2013 -0700
+++ b/tools/libxl/xl_cmdtable.c	Sun Oct 20 11:59:16 2013 -0700
@@ -481,6 +481,9 @@ struct cmd_spec cmd_table[] = {
       "-i MS                   Checkpoint domain memory every MS
milliseconds (def. 200ms).\n"
       "-b                      Replicate memory checkpoints to /dev/null
(blackhole)\n"
       "-u                      Disable memory checkpoint
compression.\n"
+      "-n                      Enable network output buffering.\n"
+      "-N <netbufscript>       Use netbufscript to setup network
buffering instead of the\n"
+      "                        instead of the default
(/etc/xen/scripts/remus-netbuf-setup).\n"
       "-s <sshcommand>         Use <sshcommand> instead of
ssh.  String will be passed\n"
       "                        to sh. If empty, run <host> instead
of \n"
       "                        ssh <host> xl migrate-receive -r
[-e]\n"

Shriram Rajagopalan

2013-Oct-30 23:05 UTC

head link

Re: [PATCH 0 of 5 V3] Remus/Libxl: Network buffering support

On Sun, Oct 20, 2013 at 10:58 PM, Shriram Rajagopalan
<rshriram@cs.ubc.ca>wrote:
> This patch series adds support for network buffering in the Remus
> codebase in libxl.
>
> Changes in V3:
> [1/5] Fix redundant checks in configure scripts
>       (based on Ian Campbell''s suggestions)
>
> [2/5] Introduce locking in the script, during IFB setup.
>       Add xenstore paths used by netbuf scripts
>       to xenstore-paths.markdown
>
> [3/5] Hotplug scripts setup/teardown invocations are now asynchronous
>       following IanJ''s feedback.  However, the invocations are
still
>       sequential.
>
> [5/5] Allow per-domain specification of netbuffer scripts in xl remus
>       commmand.
>
> And minor nits throughout the series based on feedback from
> the last version
>
> Changes in V2:
> [1/5] Configure script will automatically enable/disable network
>       buffer support depending on the availability of the appropriate
>       libnl3 version. [If libnl3 is unavailable, a warning message will be
>       printed to let the user know that the feature has been disabled.]
>
>       use macros from pkg.m4 instead of pkg-config commands
>       removed redundant checks for libnl3 libraries.
>
> [3,4/5] - Minor nits.
>
> Version 1:
>
> [1/5] Changes to autoconf scripts to check for libnl3. Add linker flags
>       to libxl Makefile.
>
> [2/5] External script to setup/teardown network buffering using
libnl3''s
>       CLI. This script will be invoked by libxl before starting Remus.
>       The script''s main job is to bring up an IFB device with plug
qdisc
>       attached to it.  It then re-routes egress traffic from the
guest''s
>       vif to the IFB device.
>
> [3/5] Libxl code to invoke the external setup script, followed by netlink
>       related setup to obtain a handle on the output buffers attached
>       to each vif.
>
> [4/5] Libxl interaction with network buffer module in the kernel via
>       libnl3 API.
>
> [5/5] xl cmdline switch to explicitly enable network buffering when
>       starting remus.
>
>
>   Few things to note:
>
>     a) Based on previous email discussions, the setup/teardown task has
>     been moved to a hotplug style shell script which can be customized as
>     desired, instead of implementing it as C code inside libxl.
>
>     b) Libnl3 is not available on NetBSD. Nor is it available on CentOS
>    (Linux).  So I have made network buffering support an optional feature
>    so that it can be disabled if desired.
>
>    c) NetBSD does not have libnl3. So I have put the setup script under
>    tools/hotplug/Linux folder.
>
> thanks
> shriram
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
ping!


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2013-Oct-31 20:13 UTC

head link

Re: [PATCH 1 of 5 V3] remus: add libnl3 dependency to autoconf scripts

On Sun, 2013-10-20 at 22:58 -0700, Shriram Rajagopalan
wrote:> diff -r 808aba98084e -r 357655a1598c tools/configure.ac
> --- a/tools/configure.ac	Fri Oct 11 18:16:11 2013 -0700
> +++ b/tools/configure.ac	Fri Oct 11 18:16:21 2013 -0700
> @@ -219,6 +219,24 @@ AC_SUBST(libiconv)
>  # Checks for header files.
>  AC_CHECK_HEADERS([yajl/yajl_version.h sys/eventfd.h])
>  
> +# Check for libnl3 >=3.2.8. If present enable remus network buffering.
> +PKG_CHECK_MODULES(LIBNL3, [libnl-3.0 >= 3.2.8 libnl-route-3.0 >=
3.2.8],
> +		[libnl3_lib="y"], [libnl3_lib="n"])
> +
> +# Check for libnl3 command line utilities
These cli utils are needed on the runtime host but not the build host,
right? (for the benefit of the hotplug scripts I suppose?)

This check only runs on the build host, so it will fail needlessly if
the tools aren''t there while at the same time not catching the case
where they are missing from the runtime host.

Best bet is to omit this check and simply list them in the README, in
the list of optional tools (and reference tools/remus/README for more
info if you like).

Ian.

Ian Campbell

2013-Oct-31 20:21 UTC

head link

Re: [PATCH 2 of 5 V3] tools/hotplug: Remus network buffering setup scripts

On Sun, 2013-10-20 at 22:58 -0700, Shriram Rajagopalan wrote:
> +check_libnl_tools() {
> +    if ! command -v nl-qdisc-list > /dev/null 2>&1; then
> +        fatal "Unable to find nl-qdisc-list tool"
> +    fi
> +    if ! command -v nl-qdisc-add > /dev/null 2>&1; then
> +        fatal "Unable to find nl-qdisc-add tool"
> +    fi
> +    if ! command -v nl-qdisc-delete > /dev/null 2>&1; then
> +        fatal "Unable to find nl-qdisc-delete tool"
> +    fi
> +}
This probably suffices in place of the configure check I commented on
earlier.
> +add_plug_qdisc() {
> +    local vif=$1
> +    local ifb=$2
> +
> +    nl-qdisc-add --dev="$ifb" --parent root plug >/dev/null
2>&1
> +    if [ $? -ne 0 ]
> +    then
> +	do_without_error tc qdisc del dev "$vif" ingress
> +	fatal "Failed to add plug qdisc to $ifb"
> +    fi
> +
> +    #set ifb buffering limit in bytes. Its okay if this command
... "fails" ?
> +
> +case "$command" in
> +    setup)
> +	check_libnl_tools
> +	check_modules
> +
> +	claim_lock "pickifb"
> +	setup_ifb
> +	redirect_vif_traffic "$vifname" "$IFB"
> +	add_plug_qdisc "$vifname" "$IFB"
> +	release_lock "pickifb"
> +
> +	#not using xenstore_write that automatically exits on error
> +	# because we need to cleanup
whitespace inconsistency.

> +	_xenstore_write "$XENBUS_PATH/ifb" "$IFB" ||
xs_write_failed "$vifname" "$IFB"
> +	;;
> +    teardown)
> +	: ${IFB?}
Do you mean log debug or something here?
> +	teardown_netbuf "$vifname" "$IFB"
> +	;;
> +esac
> +
> +log debug "Successful remus-netbuf-setup $command for $vifname, ifb
$IFB."
> +
> +if [ "$command" = "setup" ]
> +then
> +  success
> +fi
Why not put this in the case for setup?

Ian.

Ian Campbell

2013-Oct-31 20:28 UTC

head link

Re: [PATCH 3 of 5 V3] tools/libxl: setup/teardown Remus network buffering

On Sun, 2013-10-20 at 22:58 -0700, Shriram Rajagopalan
wrote:> # HG changeset patch
> # User Shriram Rajagopalan <rshriram@cs.ubc.ca>
> # Date 1382295266 25200
> # Node ID d3f088236c550213fc04ed982df47b4771b28d2f
> # Parent  0001a8222a785865b753acf8bcdf97c10c9e3819
> tools/libxl: setup/teardown Remus network buffering
> 
> Setup/teardown remus network buffering for a given guest, when
> libxl_domain_remus_start API is invoked.
> 
> This patch does the following during setup:
>  a) call the hotplug script for each vif to setup its network buffer
> 
>  b) establish a dedicated remus context containing libnl related
>     state (netlink sockets, qdisc caches, etc.,)
> 
>  c) Obtain handles to plug qdiscs installed on the IFB devices
>     chosen by the hotplug scripts.
> 
> During teardown, the netlink resources are released, followed by
> invocation of hotplug scripts to remove the IFB devices.
> 
> Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
All this async stuff needs Ian J''s eye on it really, not mine. It looks
vaguely plausible to my untrained eye.
> diff -r 0001a8222a78 -r d3f088236c55 tools/libxl/Makefile
> --- a/tools/libxl/Makefile	Sun Oct 20 06:12:52 2013 -0700
> +++ b/tools/libxl/Makefile	Sun Oct 20 11:54:26 2013 -0700
> @@ -42,6 +42,13 @@ LIBXL_OBJS-y += libxl_blktap2.o
>  else
>  LIBXL_OBJS-y += libxl_noblktap2.o
>  endif
> +
> +ifeq ($(CONFIG_REMUS_NETBUF),y)
> +LIBXL_OBJS-y += libxl_netbuffer.o
Should the addition of the libnl stuff to LIBS and CFLAGS have been
under here too?
> diff -r 0001a8222a78 -r d3f088236c55 tools/libxl/libxl_types.idl
> --- a/tools/libxl/libxl_types.idl	Sun Oct 20 06:12:52 2013 -0700
> +++ b/tools/libxl/libxl_types.idl	Sun Oct 20 11:54:26 2013 -0700
> @@ -539,6 +539,8 @@ libxl_domain_remus_info = Struct("domain
>      ("interval",     integer),
>      ("blackhole",    bool),
>      ("compression",  bool),
> +    ("netbuf",       bool),
I guess libxl_defbool doesn''t make sense here.
> +    ("netbufscript", string),
This requires a LIBXL_HAVE #define in libxl.h so users know it is
available. One for the overall netbuf feature should do I think.

Ian.

Ian Campbell

2013-Oct-31 20:31 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks

On Sun, 2013-10-20 at 22:58 -0700, Shriram Rajagopalan
wrote:> # HG changeset patch
> # User Shriram Rajagopalan <rshriram@cs.ubc.ca>
> # Date 1382295546 25200
> # Node ID a8deb9499e9dcce9869025fa1c02cf2e0d58612a
> # Parent  d3f088236c550213fc04ed982df47b4771b28d2f
> tools/libxl: Control network buffering in remus callbacks
> 
> This patch constitutes the core network buffering logic.
> and does the following:
>  a) create a new network buffer when the domain is suspended
>     (remus_domain_suspend_callback)
>  b) release the previous network buffer pertaining to the
>     committed checkpoint (remus_domain_checkpoint_dm_saved)
> 
> Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
Another one where Ian J''s opinion matters more than mine, but it looks
ok to me.

Ian.

Ian Campbell

2013-Oct-31 20:38 UTC

head link

Re: [PATCH 5 of 5 V3] tools/xl: Remus - Network buffering cmdline switch

On Sun, 2013-10-20 at 22:58 -0700, Shriram Rajagopalan
wrote:> diff -r a8deb9499e9d -r f4eea1e0ac3e docs/man/xl.pod.1
> --- a/docs/man/xl.pod.1	Sun Oct 20 11:59:06 2013 -0700
> +++ b/docs/man/xl.pod.1	Sun Oct 20 11:59:16 2013 -0700
> @@ -398,8 +398,7 @@ Print huge (!) amount of debug during th
>  Enable Remus HA for domain. By default B<xl> relies on ssh as a
transport
>  mechanism between the two hosts.
>  
> -N.B: Remus support in xl is still in experimental (proof-of-concept)
phase.
> -     There is no support for network or disk buffering at the moment.
> +N.B: There is no support for disk buffering at the moment.
>  
>  B<OPTIONS>
>  
> @@ -418,6 +417,13 @@ Generally useful for debugging.
>  
>  Disable memory checkpoint compression.
>  
> +=item B<-n>
> +
> +Enable network output buffering.  The default script used to configure
> +network buffering is /etc/xen/scripts/remus-netbuf-setup. If you wish to
> +use a custom script, set the global variable
"remus.default.netbufscript"
> +in /etc/xen/xl.conf to point to your script.
No docs for -N ?
> @@ -7133,6 +7140,11 @@ int main_remus(int argc, char **argv)
>      domid = find_domain(argv[optind]);
>      host = argv[optind + 1];
>  
> +    if (r_info.netbuf && !r_info.netbufscript) {
Will the code do anything if netbuf==False but netbufscript!=NULL?
> +        if (default_remus_netbufscript)
This if is redundant isn''t it, it just avoids assigning NULL to a
variable which is already NULL.

Given the above two comments this whole block could be
	if (!r_info.netbufscript)
		r_info.netbufscript = default...
[...]> -    fprintf(stderr, "remus sender: libxl_domain_suspend failed"
> -            " (rc=%d)\n", rc);
Was this removed on purpose? It seemed useful looking and wasn''t
mentioned in the commit message.

Ian.

Shriram Rajagopalan

2013-Oct-31 21:06 UTC

head link

Re: [PATCH 2 of 5 V3] tools/hotplug: Remus network buffering setup scripts

On Thu, Oct 31, 2013 at 1:21 PM, Ian Campbell
<Ian.Campbell@citrix.com>wrote:
> > +
> > +case "$command" in
> > +    setup)
> > +     check_libnl_tools
> > +     check_modules
> > +
> > +     claim_lock "pickifb"
> > +     setup_ifb
> > +     redirect_vif_traffic "$vifname" "$IFB"
> > +     add_plug_qdisc "$vifname" "$IFB"
> > +     release_lock "pickifb"
> > +
> > +     #not using xenstore_write that automatically exits on error
> > +     # because we need to cleanup
>
> whitespace inconsistency.
>
>I seem to get this wrong again and again..
Did you mean the space trailing the second # ? or should the code block
inside setup/teardown
be indented with two tabs instead of one ?

> > +     _xenstore_write "$XENBUS_PATH/ifb" "$IFB" ||
xs_write_failed
> "$vifname" "$IFB"
> > +     ;;
> > +    teardown)
> > +     : ${IFB?}
>
> Do you mean log debug or something here?
>
>This was to just make sure that the IFB variable was supplied as part of
the environment..
Just like the two checks on top of this script..
"
: ${vifname?}
: ${XENBUS_PATH?}
"


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Shriram Rajagopalan

2013-Oct-31 21:47 UTC

head link

Re: [PATCH 5 of 5 V3] tools/xl: Remus - Network buffering cmdline switch

On Thu, Oct 31, 2013 at 1:38 PM, Ian Campbell
<Ian.Campbell@citrix.com>wrote:
> On Sun, 2013-10-20 at 22:58 -0700, Shriram Rajagopalan wrote:
> > diff -r a8deb9499e9d -r f4eea1e0ac3e docs/man/xl.pod.1
> > --- a/docs/man/xl.pod.1       Sun Oct 20 11:59:06 2013 -0700
> > +++ b/docs/man/xl.pod.1       Sun Oct 20 11:59:16 2013 -0700
> > @@ -398,8 +398,7 @@ Print huge (!) amount of debug during th
> >  Enable Remus HA for domain. By default B<xl> relies on ssh as a
> transport
> >  mechanism between the two hosts.
> >
> > -N.B: Remus support in xl is still in experimental (proof-of-concept)
> phase.
> > -     There is no support for network or disk buffering at the moment.
> > +N.B: There is no support for disk buffering at the moment.
> >
> >  B<OPTIONS>
> >
> > @@ -418,6 +417,13 @@ Generally useful for debugging.
> >
> >  Disable memory checkpoint compression.
> >
> > +=item B<-n>
> > +
> > +Enable network output buffering.  The default script used to
configure
> > +network buffering is /etc/xen/scripts/remus-netbuf-setup. If you wish
to
> > +use a custom script, set the global variable
> "remus.default.netbufscript"
> > +in /etc/xen/xl.conf to point to your script.
>
> No docs for -N ?
>
> > @@ -7133,6 +7140,11 @@ int main_remus(int argc, char **argv)
> >      domid = find_domain(argv[optind]);
> >      host = argv[optind + 1];
> >
> > +    if (r_info.netbuf && !r_info.netbufscript) {
>
> Will the code do anything if netbuf==False but netbufscript!=NULL?
>
> > +        if (default_remus_netbufscript)
>
> This if is redundant isn''t it, it just avoids assigning NULL to a
> variable which is already NULL.
>
> Given the above two comments this whole block could be
>         if (!r_info.netbufscript)
>                 r_info.netbufscript = default...
> [...]
> > -    fprintf(stderr, "remus sender: libxl_domain_suspend
failed"
> > -            " (rc=%d)\n", rc);
>
> Was this removed on purpose? It seemed useful looking and wasn''t
> mentioned in the commit message.
>
>Yes, because the error message was not representative. That code block
currently looks like this (after removing the fprintf), which is bit more
informative.

 /* If we are here, it means remus setup/domain suspend/backup has


     * failed. Try to resume the domain and exit gracefully.


     * TODO: Split-Brain check.


     */
    if (rc == ERROR_GUEST_TIMEDOUT)
        fprintf(stderr, "Failed to suspend domain at primary.\n");
    else {
        fprintf(stderr, "Remus: Backup failed? resuming domain at
primary.\n");
        libxl_domain_resume(ctx, domid, 1, 0);
    }


>  Ian.
>
>
>
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2013-Oct-31 22:25 UTC

head link

Re: [PATCH 2 of 5 V3] tools/hotplug: Remus network buffering setup scripts

On Thu, 2013-10-31 at 14:06 -0700, Shriram Rajagopalan
wrote:> On Thu, Oct 31, 2013 at 1:21 PM, Ian Campbell
> <Ian.Campbell@citrix.com> wrote:
>         > +
>         
>         > +case "$command" in
>         > +    setup)
>         > +     check_libnl_tools
>         > +     check_modules
>         > +
>         > +     claim_lock "pickifb"
>         > +     setup_ifb
>         > +     redirect_vif_traffic "$vifname"
"$IFB"
>         > +     add_plug_qdisc "$vifname" "$IFB"
>         > +     release_lock "pickifb"
>         > +
>         > +     #not using xenstore_write that automatically exits on
>         error
>         > +     # because we need to cleanup
>         
>         
>         whitespace inconsistency.
>         
>         
> 
> 
> I seem to get this wrong again and again..
> Did you mean the space trailing the second # ? or should the code
> block inside setup/teardown
> be indented with two tabs instead of one ?
I meant "#foo" (1st line) vs "# foo" (2nd line) which just
caught my
eye. They should be the same. More normal would be a space after the # I
think.

I don''t think we have a particularly strict coding style for shell
scripts, but code being consistent with itself is a good start.
> 
> 
>         
>         > +     _xenstore_write "$XENBUS_PATH/ifb"
"$IFB" ||
>         xs_write_failed "$vifname" "$IFB"
>         > +     ;;
>         > +    teardown)
>         > +     : ${IFB?}
>         
>         
>         Do you mean log debug or something here?
>         
>         
> 
> 
> This was to just make sure that the IFB variable was supplied as part
> of the environment..
> Just like the two checks on top of this script..
> "
> : ${vifname?}
> : ${XENBUS_PATH?}
> "
Do these result in useful error reporting to the end user? Bearing in
mind that the end user might be using libxl but not xl and therefore not
see stdout/err (which might be going into some daemon''s log file).

Also, either I''m missing it or the bash manpage only documents
${parameter:?} and not ${parameter?}. Are both actually valid?

Ian.

Ian Campbell

2013-Oct-31 22:29 UTC

head link

Re: [PATCH 5 of 5 V3] tools/xl: Remus - Network buffering cmdline switch

On Thu, 2013-10-31 at 14:47 -0700, Shriram Rajagopalan wrote:
> 
> Yes, because the error message was not representative. That code block
> currently looks like this (after removing the fprintf), which is bit
> more informative.
Makes sense, thanks.
> 
> 
>  /* If we are here, it means remus setup/domain suspend/backup has
> 
> 
>         
>      * failed. Try to resume the domain and exit gracefully.
> 
> 
>            
>      * TODO: Split-Brain check.
> 
> 
>           
>      */
>     if (rc == ERROR_GUEST_TIMEDOUT)
>         fprintf(stderr, "Failed to suspend domain at
primary.\n");
>     else {
>         fprintf(stderr, "Remus: Backup failed? resuming domain at
> primary.\n");
>         libxl_domain_resume(ctx, domid, 1, 0);
>     }
> 
> 
>  
>         Ian.
>         
>         
>         
> 
>

Ian Jackson

2013-Nov-01 18:28 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks

Shriram Rajagopalan writes ("[PATCH 4 of 5 V3] tools/libxl: Control network
buffering in remus callbacks"):> tools/libxl: Control network buffering in remus callbacks
...>  static int libxl__remus_domain_suspend_callback(void *data)
>  {
> -    /* REMUS TODO: Issue disk and network checkpoint reqs. */
> -    return libxl__domain_suspend_common_callback(data);
> +    /* REMUS TODO: Issue disk checkpoint reqs. */
> +    libxl__save_helper_state *shs = data;
> +    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
> +    libxl__remus_ctx *remus_ctx = dss->remus_ctx;
> +    bool is_suspended;
> +    STATE_AO_GC(dss->ao);
> +
> +    is_suspended = !!libxl__domain_suspend_common_callback(data);
I think this function would be less confusing if the return value
variable was called "ok" or something similar.  It''s really
just an
error flag.

Also AFAICT the !! has no function here.  The return value is
essentially a boolean.
> +    if (!remus_ctx->netbuf_ctx) return is_suspended;
> +
> +    if (is_suspended) {
> +        if (libxl__remus_netbuf_start_new_epoch(gc, dss->domid,
> +                                                remus_ctx))
> +            return !is_suspended;
This construction is rather odd.  Why return !is_suspended when you
know that !!is_suspended ?
> @@ -1300,11 +1316,42 @@ static void libxl__remus_domain_checkpoi
>  static void remus_checkpoint_dm_saved(libxl__egc *egc,
>                                        libxl__domain_suspend_state *dss,
int rc)
>  {
> -    /* REMUS TODO: Wait for disk and memory ack, release network buffer */
> -    /* REMUS TODO: make this asynchronous */
> -    assert(!rc); /* REMUS TODO handle this error properly */
> -    usleep(dss->remus_ctx->interval * 1000);
> -    libxl__xc_domain_saverestore_async_callback_done(egc,
&dss->shs, 1);
> +    /*
> +     * REMUS TODO: Wait for disk and explicit memory ack (through restore
> +     * callback from remote) before releasing network buffer.
> +     */
> +    libxl__remus_ctx *remus_ctx = dss->remus_ctx;
> +    struct timespec epoch;
> +    int do_next_iter = 0;
Again, this variable is probably best called "ok".
> +    epoch.tv_sec = remus_ctx->interval / 1000; /* interval is in ms */
> +    epoch.tv_nsec = remus_ctx->interval * 1000L * 1000L;
> +    nanosleep(&epoch, 0);
This is no good, I''m afraid.  You should be using a libxl event loop
timer.  (Also why are you changing your existing usleep to this new
nanosleep which seems functionally very similar?)

Ian.

Shriram Rajagopalan

2013-Nov-01 19:57 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks

On Fri, Nov 1, 2013 at 11:28 AM, Ian Jackson
<Ian.Jackson@eu.citrix.com>wrote:
> Shriram Rajagopalan writes ("[PATCH 4 of 5 V3] tools/libxl: Control
> network buffering in remus callbacks"):
> > tools/libxl: Control network buffering in remus callbacks
> ...
> >  static int libxl__remus_domain_suspend_callback(void *data)
> >  {
> > -    /* REMUS TODO: Issue disk and network checkpoint reqs. */
> > -    return libxl__domain_suspend_common_callback(data);
> > +    /* REMUS TODO: Issue disk checkpoint reqs. */
> > +    libxl__save_helper_state *shs = data;
> > +    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
> > +    libxl__remus_ctx *remus_ctx = dss->remus_ctx;
> > +    bool is_suspended;
> > +    STATE_AO_GC(dss->ao);
> > +
> > +    is_suspended = !!libxl__domain_suspend_common_callback(data);
>
> I think this function would be less confusing if the return value
> variable was called "ok" or something similar.  It''s
really just an
> error flag.
>
> Also AFAICT the !! has no function here.  The return value is
> essentially a boolean.
>
> > +    if (!remus_ctx->netbuf_ctx) return is_suspended;
> > +
> > +    if (is_suspended) {
> > +        if (libxl__remus_netbuf_start_new_epoch(gc, dss->domid,
> > +                                                remus_ctx))
> > +            return !is_suspended;
>
> This construction is rather odd.  Why return !is_suspended when you
> know that !!is_suspended ?
>
> > @@ -1300,11 +1316,42 @@ static void libxl__remus_domain_checkpoi
> >  static void remus_checkpoint_dm_saved(libxl__egc *egc,
> >                                        libxl__domain_suspend_state
*dss,
> int rc)
> >  {
> > -    /* REMUS TODO: Wait for disk and memory ack, release network
buffer
> */
> > -    /* REMUS TODO: make this asynchronous */
> > -    assert(!rc); /* REMUS TODO handle this error properly */
> > -    usleep(dss->remus_ctx->interval * 1000);
> > -    libxl__xc_domain_saverestore_async_callback_done(egc,
&dss->shs, 1);
> > +    /*
> > +     * REMUS TODO: Wait for disk and explicit memory ack (through
> restore
> > +     * callback from remote) before releasing network buffer.
> > +     */
> > +    libxl__remus_ctx *remus_ctx = dss->remus_ctx;
> > +    struct timespec epoch;
> > +    int do_next_iter = 0;
>
> Again, this variable is probably best called "ok".
>
> > +    epoch.tv_sec = remus_ctx->interval / 1000; /* interval is in
ms */
> > +    epoch.tv_nsec = remus_ctx->interval * 1000L * 1000L;
> > +    nanosleep(&epoch, 0);
>
> This is no good, I''m afraid.  You should be using a libxl event
loop
> timer.  (Also why are you changing your existing usleep to this new
> nanosleep which seems functionally very similar?)
>
> Ian.
>
>
Nanosleep was more of my personal preference, just as the ''!!''
above
was Andrew''s suggestion, due to the fact that
libxl__domain_suspend_common_callback()
does not strictly return boolean. But I agree that renaming that variable
makes more sense.

As far as using the event loop timers, I couldn''t find anything
suitable in
libxl_event.h/c that
I could use instead of usleep/nanosleep.  Specifically, they were all
asynchronous, while the
caller of these functions is libxc''s xc_domain_save infinite loop
(synchronous).


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Shriram Rajagopalan

2013-Nov-01 20:04 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks

On Fri, Nov 1, 2013 at 11:28 AM, Ian Jackson
<Ian.Jackson@eu.citrix.com>wrote:
> Shriram Rajagopalan writes ("[PATCH 4 of 5 V3] tools/libxl: Control
> network buffering in remus callbacks"):
> > tools/libxl: Control network buffering in remus callbacks
> ...
> >  static int libxl__remus_domain_suspend_callback(void *data)
> >  {
> > -    /* REMUS TODO: Issue disk and network checkpoint reqs. */
> > -    return libxl__domain_suspend_common_callback(data);
> > +    /* REMUS TODO: Issue disk checkpoint reqs. */
> > +    libxl__save_helper_state *shs = data;
> > +    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
> > +    libxl__remus_ctx *remus_ctx = dss->remus_ctx;
> > +    bool is_suspended;
> > +    STATE_AO_GC(dss->ao);
> > +
> > +    is_suspended = !!libxl__domain_suspend_common_callback(data);
>
> I think this function would be less confusing if the return value
> variable was called "ok" or something similar.  It''s
really just an
> error flag.
>
> Also AFAICT the !! has no function here.  The return value is
> essentially a boolean.
>
> > +    if (!remus_ctx->netbuf_ctx) return is_suspended;
> > +
> > +    if (is_suspended) {
> > +        if (libxl__remus_netbuf_start_new_epoch(gc, dss->domid,
> > +                                                remus_ctx))
> > +            return !is_suspended;
>
> This construction is rather odd.  Why return !is_suspended when you
> know that !!is_suspended ?
>
>If a new network buffer cannot be created (for the next epoch), then
return an error. This ends up terminating the checkpointing process.


> > @@ -1300,11 +1316,42 @@ static void libxl__remus_domain_checkpoi
> >  static void remus_checkpoint_dm_saved(libxl__egc *egc,
> >                                        libxl__domain_suspend_state
*dss,
> int rc)
> >  {
> > -    /* REMUS TODO: Wait for disk and memory ack, release network
buffer
> */
> > -    /* REMUS TODO: make this asynchronous */
> > -    assert(!rc); /* REMUS TODO handle this error properly */
> > -    usleep(dss->remus_ctx->interval * 1000);
> > -    libxl__xc_domain_saverestore_async_callback_done(egc,
&dss->shs, 1);
> > +    /*
> > +     * REMUS TODO: Wait for disk and explicit memory ack (through
> restore
> > +     * callback from remote) before releasing network buffer.
> > +     */
> > +    libxl__remus_ctx *remus_ctx = dss->remus_ctx;
> > +    struct timespec epoch;
> > +    int do_next_iter = 0;
>
> Again, this variable is probably best called "ok".
>
> > +    epoch.tv_sec = remus_ctx->interval / 1000; /* interval is in
ms */
> > +    epoch.tv_nsec = remus_ctx->interval * 1000L * 1000L;
> > +    nanosleep(&epoch, 0);
>
> This is no good, I''m afraid.  You should be using a libxl event
loop
> timer.  (Also why are you changing your existing usleep to this new
> nanosleep which seems functionally very similar?)
>
> Ian.
>
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Jackson

2013-Nov-04 12:12 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

Shriram Rajagopalan writes ("Re: [PATCH 4 of 5 V3] tools/libxl: Control
network buffering in remus callbacks"):> Nanosleep was more of my personal preference,
I don''t think that''s a good enough reason for the churn, but
as I say
this really ought to be replaced with use of a timeout event.
> just as the ''!!'' above was Andrew''s suggestion,
due to the fact that
> libxl__domain_suspend_common_callback() does not strictly return
> boolean.
Are you looking at a different tree to me ?  I have checked the return
statements in libxl__domain_suspend_common_callback and they are all
"return 0" or "return 1".
> But I agree that renaming that variable makes more sense.
Good :-).  Thanks.
> As far as using the event loop timers, I couldn''t find anything
> suitable in libxl_event.h/c that I could use instead of
> usleep/nanosleep.  Specifically, they were all asynchronous, while
> the caller of these functions is libxc''s xc_domain_save infinite
> loop (synchronous).
No, you need to make it an asynchronous call.

The libxl save helper machinery arranges to run xc_domain_save in a
subprocess so that the main program can continue asynchronously.

It''s true that the suspend checkpoint is currently synchronous.  It
needs to be made synchronous by telling the stub generator
(libxl_save_msgs_gen.pl) to generate an asynchronous binding, like it
does for switch_qemu_logdirty.

Perhaps it would be helpful if I provided a pre-patch to make that
change for you.
>     > +    if (!remus_ctx->netbuf_ctx) return is_suspended;
>     > +
>     > +    if (is_suspended) {
>     > +        if (libxl__remus_netbuf_start_new_epoch(gc,
dss->domid,
>     > +                                                remus_ctx))
>     > +            return !is_suspended;
>
>     This construction is rather odd.  Why return !is_suspended when you
>     know that !!is_suspended ?
...> If a new network buffer cannot be created (for the next epoch), then
> return an error. This ends up terminating the checkpointing process.
Yes.  But why not "return 0" ?

Thanks,
Ian.

Shriram Rajagopalan

2013-Nov-04 15:17 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

On Mon, Nov 4, 2013 at 6:12 AM, Ian Jackson
<Ian.Jackson@eu.citrix.com>wrote:
> Shriram Rajagopalan writes ("Re: [PATCH 4 of 5 V3] tools/libxl:
Control
> network buffering in remus callbacks"):
> > Nanosleep was more of my personal preference,
>
> I don''t think that''s a good enough reason for the churn,
but as I say
> this really ought to be replaced with use of a timeout event.
>
>
Fair enough. My question is what is the overhead of setting up, firing and
tearing down
a timeout event using the event gen framework, if I wish to checkpoint the
VM, say every 20ms ?
If you happen to have the numbers off the top of your head, it would help.
Or if you are sure that
the system can easily handle this rate of events with very little overhead
(<0.2ms per event)

> > As far as using the event loop timers, I couldn''t find
anything
> > suitable in libxl_event.h/c that I could use instead of
> > usleep/nanosleep.  Specifically, they were all asynchronous, while
> > the caller of these functions is libxc''s xc_domain_save
infinite
> > loop (synchronous).
>
> No, you need to make it an asynchronous call.
>
> The libxl save helper machinery arranges to run xc_domain_save in a
> subprocess so that the main program can continue asynchronously.
>
> It''s true that the suspend checkpoint is currently synchronous. 
It
> needs to be made synchronous by telling the stub generator
> (libxl_save_msgs_gen.pl) to generate an asynchronous binding, like it
> does for switch_qemu_logdirty.
>
> Perhaps it would be helpful if I provided a pre-patch to make that
> change for you.
>
>Yes, that would be pretty helpful. Thanks!


shriram


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2013-Nov-04 15:32 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

On Mon, 2013-11-04 at 09:17 -0600, Shriram Rajagopalan
wrote:> On Mon, Nov 4, 2013 at 6:12 AM, Ian Jackson
> <Ian.Jackson@eu.citrix.com> wrote:
>         Shriram Rajagopalan writes ("Re: [PATCH 4 of 5 V3]
>         tools/libxl: Control network buffering in remus callbacks"):
>         > Nanosleep was more of my personal preference,
>         
>         I don''t think that''s a good enough reason for the
churn, but
>         as I say
>         this really ought to be replaced with use of a timeout event.
>         
> 
> 
> 
> 
> Fair enough. My question is what is the overhead of setting up, firing
> and tearing down
> a timeout event using the event gen framework, if I wish to checkpoint
> the VM, say every 20ms ?
> If you happen to have the numbers off the top of your head, it would
> help. Or if you are sure that
> the system can easily handle this rate of events with very little
> overhead (<0.2ms per event)
Regardless of the answer here, would it make sense to do some/all of the
checkpoint processing in the helper subprocess anyway and only signal
the eventual failover up to the libxl process?

This async op is potentially quite long running I think compared to a
normal one i.e. if the guest doesn''t die it is expected that the ao
lives "forever". Since the associated gc''s persist until the
ao ends
this might end up accumulating lots of allocations? Ian had a similar
concern about Roger''s hotplug daemon series and suggested creating a
per
iteration gc or something.

Ian.

Ian Jackson

2013-Nov-04 16:06 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

Ian Campbell writes ("Re: [PATCH 4 of 5 V3] tools/libxl: Control network
buffering in remus callbacks [and 1 more
messages]"):> Regardless of the answer here, would it make sense to do some/all of the
> checkpoint processing in the helper subprocess anyway and only signal
> the eventual failover up to the libxl process?
It might do.  Mind you, the code in libxc is tangled enough as it is
and is due for a rewrite.  Perhaps this could be done in the helper
executable, although there isn''t currently any way to easily
intersperse code in there.
> This async op is potentially quite long running I think compared to a
> normal one i.e. if the guest doesn''t die it is expected that the
ao
> lives "forever". Since the associated gc''s persist until
the ao ends
> this might end up accumulating lots of allocations? Ian had a similar
> concern about Roger''s hotplug daemon series and suggested creating
a per
> iteration gc or something.
Yes, this is indeed a problem.  Well spotted.

Which of the xc_domain_save (and _restore) callbacks are called each
remus iteration ?

I think introducing a per-iteration gc here is going to involve taking
some care, since we need to be sure not to put
per-iteraton-gc-allocated objects into data structures which are used
by subsequent iterations.

Shriram writes:> Fair enough. My question is what is the overhead of setting up, firing
> and tearing down a timeout event using the event gen framework, if I
> wish to checkpoint the VM, say every 20ms ?
The ultimate cost of going back into the event loop to wait for a
timeout will depend on what else the process is doing.  If the process
is doing nothing else, it''s about two calls to gettimeofday and one to
poll.  Plus a bit of in-process computation, but that''s going to be
swamped by system call overhead.

Having said that, libxl is not performance-optimised.  Indeed the
callback mechanism involves context switching, and IPC, between the
save/restore helper and libxl proper.  Probably not too much to be
doing every 20ms for a single domain, but if you have a lot of these
it''s going to end up taking a lot of dom0 cpu etc.

I assume you''re not doing this for HVM domains, which involve saving
the qemu state each time too.

Ian.

Ian Jackson

2013-Nov-04 16:40 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

Shriram Rajagopalan writes ("Re: [PATCH 4 of 5 V3] tools/libxl: Control
network buffering in remus callbacks [and 1 more
messages]"):> On Mon, Nov 4, 2013 at 6:12 AM, Ian Jackson
<Ian.Jackson@eu.citrix.com> wrote>     Perhaps it would be helpful if I
provided a pre-patch to make that
>     change for you.
> 
> Yes, that would be pretty helpful. Thanks!
See below.  I have compiled this but not tested it.  It should be safe
but I can''t rule out having perpetrated some howler of a bug.  Please
let me know if it doesn''t work.


Ian Jackson writes ("Re: [PATCH 4 of 5 V3] tools/libxl: Control network
buffering in remus callbacks [and 1 more
messages]"):> Shriram writes:
> > Fair enough. My question is what is the overhead of setting up, firing
> > and tearing down a timeout event using the event gen framework, if I
> > wish to checkpoint the VM, say every 20ms ?
> 
> The ultimate cost of going back into the event loop to wait for a
> timeout will depend on what else the process is doing.  If the process
> is doing nothing else, it''s about two calls to gettimeofday and
one to
> poll.  Plus a bit of in-process computation, but that''s going to
be
> swamped by system call overhead.
> 
> Having said that, libxl is not performance-optimised.  Indeed the
> callback mechanism involves context switching, and IPC, between the
> save/restore helper and libxl proper.  Probably not too much to be
> doing every 20ms for a single domain, but if you have a lot of these
> it''s going to end up taking a lot of dom0 cpu etc.
> 
> I assume you''re not doing this for HVM domains, which involve
saving
> the qemu state each time too.
I guess another way to look at this is that changing this one timeout
from a synchronous to asynchronous version is not going to make any
noticeable difference to the performance of the whole thing.  You''re
already using all of the asynchronous save/restore helper machinery
and the libxl event loop.

So if the performance of your V3 patches is acceptable, this will be
fine too.

Ian.


From 46b08918302a8c1d2e470b7af7045557f73afde9 Mon Sep 17 00:00:00 2001
From: Ian Jackson <ian.jackson@eu.citrix.com>
Date: Mon, 4 Nov 2013 16:27:53 +0000
Subject: [PATCH] libxl: make libxl__domain_suspend_callback be asynchronous

Mark the suspend callback as asynchronous in the helper stub generator
(libxl_save_msgs_gen.pl).  Remus is going to want to provide an
asynchronous version of this function.

libxl__domain_suspend_common_callback, the common synchronous core,
which used to be provided directly as the callback function for the
helper machinery, becomes libxl__domain_suspend_callback_common.  It
can now take a typesafe parameter.

For now, provide two very similar asynchronous wrappers for it.  Each
is a simple function which contains only boilerplate, calls the common
synchronous core, and returns the asynchronous response.

Essentially, we have just moved (in the case of suspend callbacks) the
call site of libxl__srm_callout_callback_complete.  It was in the
switch statement in the autogenerated _libxl_save_msgs_callout.c, and
is now in the handwritten libxl_dom.c.

No functional change.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
Cc: Ian Campbell <ian.campbell@citrix.com>
---
 tools/libxl/libxl_dom.c            |   25 +++++++++++++++++++------
 tools/libxl/libxl_internal.h       |    2 +-
 tools/libxl/libxl_save_msgs_gen.pl |    2 +-
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 1812bdc..b5cde42 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -1003,10 +1003,8 @@ int libxl__domain_resume_device_model(libxl__gc *gc,
uint32_t domid)
     return 0;
 }
 
-int libxl__domain_suspend_common_callback(void *user)
+int libxl__domain_suspend_callback_common(libxl__domain_suspend_state *dss)
 {
-    libxl__save_helper_state *shs = user;
-    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
     STATE_AO_GC(dss->ao);
     unsigned long hvm_s_state = 0, hvm_pvdrv = 0;
     int ret;
@@ -1225,12 +1223,27 @@ int libxl__toolstack_save(uint32_t domid, uint8_t **buf,
     return 0;
 }
 
+static void libxl__domain_suspend_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__egc *egc = shs->egc;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
+
+    int ok = libxl__domain_suspend_callback_common(dss);
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, ok);
+}
+
 /*----- remus callbacks -----*/
 
-static int libxl__remus_domain_suspend_callback(void *data)
+static void libxl__remus_domain_suspend_callback(void *data)
 {
+    libxl__save_helper_state *shs = data;
+    libxl__egc *egc = shs->egc;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
+
     /* REMUS TODO: Issue disk and network checkpoint reqs. */
-    return libxl__domain_suspend_common_callback(data);
+    int ok = libxl__domain_suspend_callback_common(dss);
+    libxl__xc_domain_saverestore_async_callback_done(egc, shs, ok);
 }
 
 static int libxl__remus_domain_resume_callback(void *data)
@@ -1354,7 +1367,7 @@ void libxl__domain_suspend(libxl__egc *egc,
libxl__domain_suspend_state *dss)
         callbacks->postcopy = libxl__remus_domain_resume_callback;
         callbacks->checkpoint = libxl__remus_domain_checkpoint_callback;
     } else
-        callbacks->suspend = libxl__domain_suspend_common_callback;
+        callbacks->suspend = libxl__domain_suspend_callback;
 
     callbacks->switch_qemu_logdirty =
libxl__domain_suspend_common_switch_qemu_logdirty;
     dss->shs.callbacks.save.toolstack_save = libxl__toolstack_save;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 4f92522..79eb8f8 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2551,7 +2551,7 @@ _hidden void libxl__xc_domain_save_done(libxl__egc*, void
*dss_void,
 void libxl__xc_domain_saverestore_async_callback_done(libxl__egc *egc,
                            libxl__save_helper_state *shs, int return_value);
 
-_hidden int libxl__domain_suspend_common_callback(void *data);
+_hidden int
libxl__domain_suspend_callback_common(libxl__domain_suspend_state*);
 _hidden void libxl__domain_suspend_common_switch_qemu_logdirty
                                (int domid, unsigned int enable, void *data);
 _hidden int libxl__toolstack_save(uint32_t domid, uint8_t **buf,
diff --git a/tools/libxl/libxl_save_msgs_gen.pl
b/tools/libxl/libxl_save_msgs_gen.pl
index ee126c7..3c6bd57 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -23,7 +23,7 @@ our @msgs = (
                                                  STRING doing_what),
                                                 ''unsigned
long'', ''done'',
                                                 ''unsigned
long'', ''total''] ],
-    [  3, ''scxW'',   "suspend", [] ],         
+    [  3, ''scxA'',   "suspend", [] ],         
     [  4, ''scxW'',   "postcopy", [] ],        
     [  5, ''scxA'',   "checkpoint", [] ],      
     [  6, ''scxA'',   "switch_qemu_logdirty", 
[qw(int domid
-- 
1.7.10.4

Ian Campbell

2013-Nov-04 16:45 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

On Mon, 2013-11-04 at 16:06 +0000, Ian Jackson wrote:> Ian Campbell writes ("Re: [PATCH 4 of 5 V3] tools/libxl: Control
network buffering in remus callbacks [and 1 more messages]"):
> > Regardless of the answer here, would it make sense to do some/all of
the
> > checkpoint processing in the helper subprocess anyway and only signal
> > the eventual failover up to the libxl process?
> 
> It might do.  Mind you, the code in libxc is tangled enough as it is
> and is due for a rewrite.  Perhaps this could be done in the helper
> executable,
That''s where I meant.
>  although there isn''t currently any way to easily
> intersperse code in there.
A simple matter of programming surely ;-)

Ian.

Shriram Rajagopalan

2013-Nov-04 16:47 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

On Mon, Nov 4, 2013 at 10:06 AM, Ian Jackson
<Ian.Jackson@eu.citrix.com>wrote:
> Ian Campbell writes ("Re: [PATCH 4 of 5 V3] tools/libxl: Control
network
> buffering in remus callbacks [and 1 more messages]"):
> > Regardless of the answer here, would it make sense to do some/all of
the
> > checkpoint processing in the helper subprocess anyway and only signal
> > the eventual failover up to the libxl process?
>
> It might do.  Mind you, the code in libxc is tangled enough as it is
> and is due for a rewrite.  Perhaps this could be done in the helper
> executable, although there isn''t currently any way to easily
> intersperse code in there.
>
>
> This async op is potentially quite long running I think compared to a
> > normal one i.e. if the guest doesn''t die it is expected that
the ao
> > lives "forever". Since the associated gc''s persist
until the ao ends
> > this might end up accumulating lots of allocations? Ian had a similar
> > concern about Roger''s hotplug daemon series and suggested
creating a per
> > iteration gc or something.
>
Yes, this is indeed a problem.  Well spotted.>Which of the xc_domain_save (and _restore) callbacks are called
each> remus iteration ?
>
>Almost all of them on the xc_domain_save side. (suspend, resume,
save_qemu state, checkpoint).
xc_domain_restore doesn''t have any callbacks AFAIK. And remus as of now
does not have a component on the restore side. It piggybacks on live
migration''s
restore framework.

> I think introducing a per-iteration gc here is going to involve taking
> some care, since we need to be sure not to put
> per-iteraton-gc-allocated objects into data structures which are used
> by subsequent iterations.
>
>FWIW, the remus related code that executes per iteration does not allocate
anything.
All allocations happen only during setup and I was under the impression
that no other
allocations are taking place everytime xc_domain_save calls back into libxl.

However, it may be possible that other parts of the AO machinery (and there
are a lot of them) are allocating stuff per iteration. And if that is the
case, it could
easily lead to OOMs since Remus technically runs as long as the domain
lives.


Shriram writes:> > Fair enough. My question is what is the overhead of setting up, firing
> > and tearing down a timeout event using the event gen framework, if I
> > wish to checkpoint the VM, say every 20ms ?
>
> The ultimate cost of going back into the event loop to wait for a
> timeout will depend on what else the process is doing.  If the process
> is doing nothing else, it''s about two calls to gettimeofday and
one to
> poll.  Plus a bit of in-process computation, but that''s going to
be
> swamped by system call overhead.
>
> Having said that, libxl is not performance-optimised.  Indeed the
> callback mechanism involves context switching, and IPC, between the
> save/restore helper and libxl proper.  Probably not too much to be
> doing every 20ms for a single domain, but if you have a lot of these
> it''s going to end up taking a lot of dom0 cpu etc.
>
>Yes and that is a problem. Xend+Remus avoided this by linking
the libcheckpoint library that interfaced with both the python & libxc code.

> I assume you''re not doing this for HVM domains, which involve
saving
> the qemu state each time too.
>
>It includes HVM domains too. Although in that case, xenstore based suspend
takes about 5ms. So the checkpoint interval is typically 50ms or so.

If there is a latency sensitive task running inside
the VM, lower checkpoint interval leads to better performance.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Jackson

2013-Nov-04 17:01 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

Shriram Rajagopalan writes ("Re: [PATCH 4 of 5 V3] tools/libxl: Control
network buffering in remus callbacks [and 1 more
messages]"):> On Mon, Nov 4, 2013 at 10:06 AM, Ian Jackson
<Ian.Jackson@eu.citrix.com> wrote:
>     Which of the xc_domain_save (and _restore) callbacks are called each
>     remus iteration ?
>
> Almost all of them on the xc_domain_save side. (suspend, resume,
> save_qemu state, checkpoint).
Right.
>  xc_domain_restore doesn''t have any
> callbacks AFAIK. And remus as of now does not have a component on
> the restore side. It piggybacks on live migration''s restore
> framework.
But the libxl memory management in the restore code is currently
written to assume a finite lifetime for the ao.  So I think this needs
to be improved.

Perhaps all the suspend/restore callbacks should each get one of the
nested ao type things that Roger needs for his driver domain daemon.
> FWIW, the remus related code that executes per iteration does not
> allocate anything.  All allocations happen only during setup and I
> was under the impression that no other allocations are taking place
> everytime xc_domain_save calls back into libxl.
If this is true, then good, because we don''t need to do anything, but
there is a lot of code there and I would want to check.
> However, it may be possible that other parts of the AO machinery
> (and there are a lot of them) are allocating stuff per
> iteration. And if that is the case, it could easily lead to OOMs
> since Remus technically runs as long as the domain lives.
The ao and event machinery doesn''t do much allocation itself.
>     Having said that, libxl is not performance-optimised.  Indeed the
>     callback mechanism involves context switching, and IPC, between the
>     save/restore helper and libxl proper.  Probably not too much to be
>     doing every 20ms for a single domain, but if you have a lot of these
>     it''s going to end up taking a lot of dom0 cpu etc.
>
> Yes and that is a problem. Xend+Remus avoided this by linking
> the libcheckpoint library that interfaced with both the python & libxc
code.
Have you observed whether the performance is acceptable with your V3
patches ?
>     I assume you''re not doing this for HVM domains, which involve
saving
>     the qemu state each time too.
...> It includes HVM domains too. Although in that case, xenstore based suspend
> takes about 5ms. So the checkpoint interval is typically 50ms or so.
Right.
> If there is a latency sensitive task running inside
> the VM, lower checkpoint interval leads to better performance.
Yes.

Ian.

Shriram Rajagopalan

2013-Nov-04 17:23 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

On Mon, Nov 4, 2013 at 11:01 AM, Ian Jackson
<Ian.Jackson@eu.citrix.com>wrote:
> >     Having said that, libxl is not performance-optimised.  Indeed the
> >     callback mechanism involves context switching, and IPC, between
the> >     save/restore helper and libxl proper.  Probably not too much to be
> >     doing every 20ms for a single domain, but if you have a lot of
these
> >     it''s going to end up taking a lot of dom0 cpu etc.
> >
> > Yes and that is a problem. Xend+Remus avoided this by linking
> > the libcheckpoint library that interfaced with both the python &
libxc
> code.
>
> Have you observed whether the performance is acceptable with your V3
> patches ?
>
>Frankly I don''t know. At the moment, I am trying to make sure that the
libxl-remus
implementation is correct before I attempt to make it fast.

With/without these patches, there is a huge overhead per checkpoint. I
cannot get to
a 20ms checkpoint interval. Time to suspend/resume a domain is on the order
of tens of
milliseconds even if it was a PV domain with fast suspend support -- where
it used to
take under 1ms to suspend, checkpoint and resume an idle PV domain on
Xend+Remus.

This was one of the issues I raised a long time back. And this is currently
the second
element in my queue, the first being to get  the patches mainline, then
slowly tighten up
relevant code for performance.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Jackson

2013-Nov-04 17:33 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

Shriram Rajagopalan writes ("Re: [PATCH 4 of 5 V3] tools/libxl: Control
network buffering in remus callbacks [and 1 more
messages]"):> Frankly I don''t know. At the moment, I am trying to make sure that
> the libxl-remus implementation is correct before I attempt to make
> it fast.
> 
> With/without these patches, there is a huge overhead per
> checkpoint. I cannot get to a 20ms checkpoint interval. Time to
> suspend/resume a domain is on the order of tens of milliseconds even
> if it was a PV domain with fast suspend support -- where it used to
> take under 1ms to suspend, checkpoint and resume an idle PV domain
> on Xend+Remus.
Oh dear.
> This was one of the issues I raised a long time back. And this is
> currently the second element in my queue, the first being to get the
> patches mainline, then slowly tighten up relevant code for
> performance.
I think this is probably a bad approach.  Under the circumstances, it
seems like you really need to move the relevant code lower in the call
stack.  In particular, you want everything to be handled synchronously
in the same process as is doing the save/restore, at least for PV
domains.

That''s likely to involve substantially rewriting it.

Ian.

Shriram Rajagopalan

2013-Nov-11 17:56 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

On Mon, Nov 4, 2013 at 10:40 AM, Ian Jackson
<Ian.Jackson@eu.citrix.com>wrote:
> Shriram Rajagopalan writes ("Re: [PATCH 4 of 5 V3] tools/libxl:
Control
> network buffering in remus callbacks [and 1 more messages]"):
> > On Mon, Nov 4, 2013 at 6:12 AM, Ian Jackson
<Ian.Jackson@eu.citrix.com>
> wrote>     Perhaps it would be helpful if I provided a pre-patch to make
> that
> >     change for you.
> >
> > Yes, that would be pretty helpful. Thanks!
>
> See below.  I have compiled this but not tested it.  It should be safe
> but I can''t rule out having perpetrated some howler of a bug. 
Please
> let me know if it doesn''t work.
>
>
> Ian Jackson writes ("Re: [PATCH 4 of 5 V3] tools/libxl: Control
network
> buffering in remus callbacks [and 1 more messages]"):
> > Shriram writes:
> > > Fair enough. My question is what is the overhead of setting up,
firing
> > > and tearing down a timeout event using the event gen framework,
if I
> > > wish to checkpoint the VM, say every 20ms ?
> >
> > The ultimate cost of going back into the event loop to wait for a
> > timeout will depend on what else the process is doing.  If the process
> > is doing nothing else, it''s about two calls to gettimeofday
and one to
> > poll.  Plus a bit of in-process computation, but that''s going
to be
> > swamped by system call overhead.
> >
> > Having said that, libxl is not performance-optimised.  Indeed the
> > callback mechanism involves context switching, and IPC, between the
> > save/restore helper and libxl proper.  Probably not too much to be
> > doing every 20ms for a single domain, but if you have a lot of these
> > it''s going to end up taking a lot of dom0 cpu etc.
> >
> > I assume you''re not doing this for HVM domains, which involve
saving
> > the qemu state each time too.
>
> I guess another way to look at this is that changing this one timeout
> from a synchronous to asynchronous version is not going to make any
> noticeable difference to the performance of the whole thing. 
You''re
> already using all of the asynchronous save/restore helper machinery
> and the libxl event loop.
>
> So if the performance of your V3 patches is acceptable, this will be
> fine too.
>
> Ian.
>
>
> >From 46b08918302a8c1d2e470b7af7045557f73afde9 Mon Sep 17 00:00:00 2001
> From: Ian Jackson <ian.jackson@eu.citrix.com>
> Date: Mon, 4 Nov 2013 16:27:53 +0000
> Subject: [PATCH] libxl: make libxl__domain_suspend_callback be asynchronous
>
> Mark the suspend callback as asynchronous in the helper stub generator
> (libxl_save_msgs_gen.pl).  Remus is going to want to provide an
> asynchronous version of this function.
>
> libxl__domain_suspend_common_callback, the common synchronous core,
> which used to be provided directly as the callback function for the
> helper machinery, becomes libxl__domain_suspend_callback_common.  It
> can now take a typesafe parameter.
>
> For now, provide two very similar asynchronous wrappers for it.  Each
> is a simple function which contains only boilerplate, calls the common
> synchronous core, and returns the asynchronous response.
>
> Essentially, we have just moved (in the case of suspend callbacks) the
> call site of libxl__srm_callout_callback_complete.  It was in the
> switch statement in the autogenerated _libxl_save_msgs_callout.c, and
> is now in the handwritten libxl_dom.c.
>
> No functional change.
>
> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
> Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
> Cc: Ian Campbell <ian.campbell@citrix.com>
> ---
>  tools/libxl/libxl_dom.c            |   25 +++++++++++++++++++------
>  tools/libxl/libxl_internal.h       |    2 +-
>  tools/libxl/libxl_save_msgs_gen.pl |    2 +-
>  3 files changed, 21 insertions(+), 8 deletions(-)
>
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> index 1812bdc..b5cde42 100644
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -1003,10 +1003,8 @@ int libxl__domain_resume_device_model(libxl__gc
> *gc, uint32_t domid)
>      return 0;
>  }
>
> -int libxl__domain_suspend_common_callback(void *user)
> +int libxl__domain_suspend_callback_common(libxl__domain_suspend_state
> *dss)
>  {
> -    libxl__save_helper_state *shs = user;
> -    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
>      STATE_AO_GC(dss->ao);
>      unsigned long hvm_s_state = 0, hvm_pvdrv = 0;
>      int ret;
> @@ -1225,12 +1223,27 @@ int libxl__toolstack_save(uint32_t domid, uint8_t
> **buf,
>      return 0;
>  }
>
> +static void libxl__domain_suspend_callback(void *data)
> +{
> +    libxl__save_helper_state *shs = data;
> +    libxl__egc *egc = shs->egc;
> +    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
> +
> +    int ok = libxl__domain_suspend_callback_common(dss);
> +    libxl__xc_domain_saverestore_async_callback_done(egc, shs, ok);
> +}
> +
>  /*----- remus callbacks -----*/
>
> -static int libxl__remus_domain_suspend_callback(void *data)
> +static void libxl__remus_domain_suspend_callback(void *data)
>  {
> +    libxl__save_helper_state *shs = data;
> +    libxl__egc *egc = shs->egc;
> +    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
> +
>      /* REMUS TODO: Issue disk and network checkpoint reqs. */
> -    return libxl__domain_suspend_common_callback(data);
> +    int ok = libxl__domain_suspend_callback_common(dss);
> +    libxl__xc_domain_saverestore_async_callback_done(egc, shs, ok);
>  }
>
>  static int libxl__remus_domain_resume_callback(void *data)
> @@ -1354,7 +1367,7 @@ void libxl__domain_suspend(libxl__egc *egc,
> libxl__domain_suspend_state *dss)
>          callbacks->postcopy = libxl__remus_domain_resume_callback;
>          callbacks->checkpoint =
libxl__remus_domain_checkpoint_callback;
>      } else
> -        callbacks->suspend = libxl__domain_suspend_common_callback;
> +        callbacks->suspend = libxl__domain_suspend_callback;
>
>      callbacks->switch_qemu_logdirty >
libxl__domain_suspend_common_switch_qemu_logdirty;
>      dss->shs.callbacks.save.toolstack_save = libxl__toolstack_save;
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> index 4f92522..79eb8f8 100644
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -2551,7 +2551,7 @@ _hidden void libxl__xc_domain_save_done(libxl__egc*,
> void *dss_void,
>  void libxl__xc_domain_saverestore_async_callback_done(libxl__egc *egc,
>                             libxl__save_helper_state *shs, int
> return_value);
>
> -_hidden int libxl__domain_suspend_common_callback(void *data);
> +_hidden int
> libxl__domain_suspend_callback_common(libxl__domain_suspend_state*);
>  _hidden void libxl__domain_suspend_common_switch_qemu_logdirty
>                                 (int domid, unsigned int enable, void
> *data);
>  _hidden int libxl__toolstack_save(uint32_t domid, uint8_t **buf,
> diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/
> libxl_save_msgs_gen.pl
> index ee126c7..3c6bd57 100755
> --- a/tools/libxl/libxl_save_msgs_gen.pl
> +++ b/tools/libxl/libxl_save_msgs_gen.pl
> @@ -23,7 +23,7 @@ our @msgs = (
>                                                   STRING doing_what),
>                                                  ''unsigned
long'', ''done'',
>                                                  ''unsigned
long'', ''total'']
> ],
> -    [  3, ''scxW'',   "suspend", [] ],
> +    [  3, ''scxA'',   "suspend", [] ],
>      [  4, ''scxW'',   "postcopy", [] ],
>      [  5, ''scxA'',   "checkpoint", [] ],
>      [  6, ''scxA'',   "switch_qemu_logdirty", 
[qw(int domid
> --
> 1.7.10.4
>
>
The patch seems harmless enough. How do you want to go about this?
Do you want to post/commit this patch ? Because I have to modify my patches
accordingly. Or should I post this along with my patch series, avoiding the
need
to wait on you before I post mine ?

shriram


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2013-Nov-12 09:48 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

On Mon, 2013-11-11 at 11:56 -0600, Shriram Rajagopalan wrote:
> The patch seems harmless enough. How do you want to go about this?
> Do you want to post/commit this patch ? Because I have to modify my
> patches accordingly. Or should I post this along with my patch series,
> avoiding the need to wait on you before I post mine ?
I recommend doing both, i.e. include it at the head of your series but
also be prepared for it to go in independently, that way nobody is
blocked.

Ian.

Ian Jackson

2013-Nov-12 15:38 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

Shriram Rajagopalan writes ("Re: [PATCH 4 of 5 V3] tools/libxl: Control
network buffering in remus callbacks [and 1 more messages] [and 1 more
messages]"):> On Mon, Nov 4, 2013 at 10:40 AM, Ian Jackson
<Ian.Jackson@eu.citrix.com> wrote:
...> The patch seems harmless enough. How do you want to go about this?
If you think this is the right approach, you should start on it right
away based on my patch.  The reason I posted it quickly was because I
wanted to unblock you.
> Do you want to post/commit this patch ? Because I have to modify my
> patches accordingly. Or should I post this along with my patch
> series, avoiding the need to wait on you before I post mine ?
You can certainly take it into the start of your patch series, yes.
Like what Roger did with the nested ao patch.

Currently there isn''t any other reason to make the change in this
patch, so I don''t think it should be committed right away.  But if for
some reason it does get committed to staging, you or we can just drop
it from the start of your series.

Regards,
Ian.

Shriram Rajagopalan

2013-Nov-12 16:24 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

On Tue, Nov 12, 2013 at 9:38 AM, Ian Jackson
<Ian.Jackson@eu.citrix.com>wrote:
> Shriram Rajagopalan writes ("Re: [PATCH 4 of 5 V3] tools/libxl:
Control
> network buffering in remus callbacks [and 1 more messages] [and 1 more
> messages]"):
> > On Mon, Nov 4, 2013 at 10:40 AM, Ian Jackson
<Ian.Jackson@eu.citrix.com>
> wrote:
> ...
> > The patch seems harmless enough. How do you want to go about this?
>
> If you think this is the right approach, you should start on it right
> away based on my patch.  The reason I posted it quickly was because I
> wanted to unblock you.
>
>The nested-ao patch makes sense for Remus, even without fixing this timeout
issue.
I can modify my stuff accordingly. Probably create a nested-ao per
iteration and drop
it at the start of the next iteration.

However, the timeout part is not convincing enough. For example,
libxl__domain_suspend_common_callback [the version before your patch]
has two 6 second wait loops in the worst case..


  LOG(DEBUG, "issuing %s suspend request via XenBus control node",
          dss->hvm ? "PVHVM" : "PV");

  libxl__domain_pvcontrol_write(gc, XBT_NULL, domid, "suspend");

 LOG(DEBUG, "wait for the guest to acknowledge suspend request");
        watchdog = 60;
        while (!strcmp(state, "suspend") && watchdog > 0) {
            usleep(100000);

            state = libxl__domain_pvcontrol_read(gc, XBT_NULL, domid);
            if (!state) state = "";

            watchdog--;
        }

and then once again

 LOG(DEBUG, "wait for the guest to suspend");
    watchdog = 60;
    while (watchdog > 0) {
        xc_domaininfo_t info;

        usleep(100000);
        ret = xc_domain_getinfolist(CTX->xch, domid, 1, &info);


Now I know where the 200ms overhead per checkpoint comes from.

Shouldn''t this also be made into an event loop?  Irrespective of
whether it
is invoked in
Remus'' context or normal suspend/resume/save/restore/migrate context.

And if this remains, the Remus checkpoint interval is much lower compared
to this.
Typically between 25-100ms.


>  > Do you want to post/commit this patch ? Because I have to modify my
> > patches accordingly. Or should I post this along with my patch
> > series, avoiding the need to wait on you before I post mine ?
>
> You can certainly take it into the start of your patch series, yes.
> Like what Roger did with the nested ao patch.
>
> Currently there isn''t any other reason to make the change in this
> patch, so I don''t think it should be committed right away.  But if
for
> some reason it does get committed to staging, you or we can just drop
> it from the start of your series.
>
>The only reason it might get committed to staging without other remus
patches
would be to fix the issue I cited above.

cheers
shriram


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Jackson

2013-Nov-12 16:38 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

Shriram Rajagopalan writes ("Re: [PATCH 4 of 5 V3] tools/libxl: Control
network buffering in remus callbacks [and 1 more messages] [and 1 more
messages]"):> The nested-ao patch makes sense for Remus, even without fixing this
> timeout issue.  I can modify my stuff accordingly. Probably create a
> nested-ao per iteration and drop it at the start of the next
> iteration.
Right.  Great.
> However, the timeout part is not convincing enough. For example, 
> libxl__domain_suspend_common_callback [the version before your patch]
> has two 6 second wait loops in the worst case..
...>  LOG(DEBUG, "wait for the guest to acknowledge suspend request");
>         watchdog = 60;
>         while (!strcmp(state, "suspend") && watchdog >
0) {
>             usleep(100000);
...> and then once again 
...>         usleep(100000);
Oh dear.  That is very poor.
> Now I know where the 200ms overhead per checkpoint comes from.
> 
> Shouldn''t this also be made into an event loop?  Irrespective of
> whether it is invoked in Remus'' context or normal
> suspend/resume/save/restore/migrate context.
Yes, you are entitrely correct.

Both of these loops should be replaced with timeout/event/callback
approaches.

Do you want to attempt this or would you like me to do it ?
>     Currently there isn''t any other reason to make the change in
this
>     patch, so I don''t think it should be committed right away. 
But if for
>     some reason it does get committed to staging, you or we can just drop
>     it from the start of your series.
...> The only reason it might get committed to staging without other remus
patches
> would be to fix the issue I cited above.
Yes.

Ian.

Shriram Rajagopalan

2013-Nov-12 16:43 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

On Tue, Nov 12, 2013 at 10:38 AM, Ian Jackson
<Ian.Jackson@eu.citrix.com>wrote:
> Shriram Rajagopalan writes ("Re: [PATCH 4 of 5 V3] tools/libxl:
Control
> network buffering in remus callbacks [and 1 more messages] [and 1 more
> messages]"):
> > The nested-ao patch makes sense for Remus, even without fixing this
> > timeout issue.  I can modify my stuff accordingly. Probably create a
> > nested-ao per iteration and drop it at the start of the next
> > iteration.
>
> Right.  Great.
>
> > However, the timeout part is not convincing enough. For example,
> > libxl__domain_suspend_common_callback [the version before your patch]
> > has two 6 second wait loops in the worst case..
> ...
> >  LOG(DEBUG, "wait for the guest to acknowledge suspend
request");
> >         watchdog = 60;
> >         while (!strcmp(state, "suspend") && watchdog
> 0) {
> >             usleep(100000);
> ...
> > and then once again
> ...
> >         usleep(100000);
>
> Oh dear.  That is very poor.
>
> > Now I know where the 200ms overhead per checkpoint comes from.
> >
> > Shouldn''t this also be made into an event loop?  Irrespective
of
> > whether it is invoked in Remus'' context or normal
> > suspend/resume/save/restore/migrate context.
>
> Yes, you are entitrely correct.
>
> Both of these loops should be replaced with timeout/event/callback
> approaches.
>
> Do you want to attempt this or would you like me to do it ?
>
I can take a crack at it.

Thanks
Shriram


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Jackson

2013-Nov-12 17:00 UTC

head link

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

Shriram Rajagopalan writes ("Re: [PATCH 4 of 5 V3] tools/libxl: Control
network buffering in remus callbacks [and 1 more messages] [and 1 more
messages]"):> On Tue, Nov 12, 2013 at 10:38 AM, Ian Jackson
<Ian.Jackson@eu.citrix.com>
> wrote:
>     Shriram Rajagopalan writes ("Re: [PATCH 4 of 5 V3] tools/libxl:
Control
>     network buffering in remus callbacks [and 1 more messages] [and 1 more
>     messages]"):
...>     Both of these loops should be replaced with timeout/event/callback
>     approaches.
> 
>     Do you want to attempt this or would you like me to do it ?
> 
> I can take a crack at it.
OK.  Good luck.  If you get stuck please do shout for help.  Also note
that feature freeze is quite soon!

Ian.

Shriram Rajagopalan

2013-Nov-14 03:55 UTC

head link

Re: [PATCH 2 of 5 V3] tools/hotplug: Remus network buffering setup scripts

On Thu, Oct 31, 2013 at 5:25 PM, Ian Campbell
<Ian.Campbell@citrix.com>wrote:
> > This was to just make sure that the IFB variable was supplied as part
> > of the environment..
> > Just like the two checks on top of this script..
> > "
> > : ${vifname?}
> > : ${XENBUS_PATH?}
> > "
>
> Do these result in useful error reporting to the end user?

It displays a message "<script name>:lineno: vifname: parameter not
set or
null"
on stderr.

> Bearing in
> mind that the end user might be using libxl but not xl and therefore not
> see stdout/err (which might be going into some daemon''s log file).
>
>These parameters are supposed to be set by libxl before fork+exec-ing the
script.
And their values are not obtained from the user or the command line. They
are
generated inside libxl, and checked for non-null status.  So I dont think
we need
to worry about libxl not being able to see the error.

> Also, either I''m missing it or the bash manpage only documents
> ${parameter:?} and not ${parameter?}. Are both actually valid?
>
>You are right. The man page documents only the ${parameter:?} format,
although the "${parameter?}" also seems to work for me. I will stick
to the
documented format.

> Ian.
>
>
>
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Reasonably Related Threads

Search for more reasonably related threads

Xen devel - Oct 2013 - [PATCH 0 of 5 V3] Remus/Libxl: Network buffering support

[PATCH 0 of 5 V3] Remus/Libxl: Network buffering support

[PATCH 1 of 5 V3] remus: add libnl3 dependency to autoconf scripts

[PATCH 2 of 5 V3] tools/hotplug: Remus network buffering setup scripts

[PATCH 3 of 5 V3] tools/libxl: setup/teardown Remus network buffering

[PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks

[PATCH 5 of 5 V3] tools/xl: Remus - Network buffering cmdline switch

Re: [PATCH 0 of 5 V3] Remus/Libxl: Network buffering support

Re: [PATCH 1 of 5 V3] remus: add libnl3 dependency to autoconf scripts

Re: [PATCH 2 of 5 V3] tools/hotplug: Remus network buffering setup scripts

Re: [PATCH 3 of 5 V3] tools/libxl: setup/teardown Remus network buffering

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks

Re: [PATCH 5 of 5 V3] tools/xl: Remus - Network buffering cmdline switch

Re: [PATCH 2 of 5 V3] tools/hotplug: Remus network buffering setup scripts

Re: [PATCH 5 of 5 V3] tools/xl: Remus - Network buffering cmdline switch

Re: [PATCH 2 of 5 V3] tools/hotplug: Remus network buffering setup scripts

Re: [PATCH 5 of 5 V3] tools/xl: Remus - Network buffering cmdline switch

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

Re: [PATCH 2 of 5 V3] tools/hotplug: Remus network buffering setup scripts

Reasonably Related Threads