thr3ads.net - Xen devel - [Xen-devel] [PATCH 0/2] Fix hangup after creating checkpoint on Xen. [Feb 2011]

If this information is useful, please help other people find it:
Share via:

SUZUKI, Kazuhiro

2011-Feb-07 09:07 UTC

[Xen-devel] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Hi,

The following patch series fixes hangup after creating checkpoint on
Xen. The Linux Xen guest can be saved the state to restore later, and
also created snapshot like checkpoint via the hypervisor.
But, when the snapshot is created for the PV guest, it will hangup.

We added ''PMSG_CANCEL'' message and ''cancel''
handler in dev_pm_ops
struct in the pm-linux part.

In creating checkpoint mode, the resume handler of xenbus should not
be called. In this case, it is recognized that the suspend was canceled
in drivers/xen/manage.c and call dpm_resume_end() with PMSG_CANCEL.
If the ''cancel'' handler is defined, it is called instead of
resume().

[1/2] - Fix hangup after creating checkpoint on Xen -- pm-linux part.
[2/2] - Fix hangup after creating checkpoint on Xen -- Xen part.

Please review.

Thanks,
KAZ

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

SUZUKI, Kazuhiro

2011-Feb-07 09:08 UTC

head link

[Xen-devel] [PATCH 1/2] Fix hangup after creating checkpoint on Xen.

Hi,

This is a pm-linux part patch.

Thanks,
KAZ

Signed-off-by: Kenji Wakamiya <wkenji@jp.fujitsu.com>
Signed-off-by: Kazuhiro Suzuki <kaz@jp.fujitsu.com>
---
 drivers/base/power/main.c |    9 +++++++++
 include/linux/pm.h        |    3 +++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 8aa2443..e348b5d 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -181,6 +181,13 @@ static int pm_op(struct device *dev,
 			suspend_report_result(ops->suspend, error);
 		}
 		break;
+	case PM_EVENT_CANCEL:
+		if (ops->cancel) {
+			error = ops->cancel(dev);
+			suspend_report_result(ops->cancel, error);
+			break;
+		}
+		/* Fall through */
 	case PM_EVENT_RESUME:
 		if (ops->resume) {
 			error = ops->resume(dev);
@@ -293,6 +300,8 @@ static char *pm_verb(int event)
 		return "suspend";
 	case PM_EVENT_RESUME:
 		return "resume";
+	case PM_EVENT_CANCEL:
+		return "cancel";
 	case PM_EVENT_FREEZE:
 		return "freeze";
 	case PM_EVENT_QUIESCE:
diff --git a/include/linux/pm.h b/include/linux/pm.h
index 3b7e04b..d118781 100644
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -198,6 +198,7 @@ struct dev_pm_ops {
 	void (*complete)(struct device *dev);
 	int (*suspend)(struct device *dev);
 	int (*resume)(struct device *dev);
+	int (*cancel)(struct device *dev);
 	int (*freeze)(struct device *dev);
 	int (*thaw)(struct device *dev);
 	int (*poweroff)(struct device *dev);
@@ -291,6 +292,7 @@ struct dev_pm_ops name = { \
 #define PM_EVENT_USER		0x0100
 #define PM_EVENT_REMOTE		0x0200
 #define PM_EVENT_AUTO		0x0400
+#define PM_EVENT_CANCEL		0x0800
 
 #define PM_EVENT_SLEEP		(PM_EVENT_SUSPEND | PM_EVENT_HIBERNATE)
 #define PM_EVENT_USER_SUSPEND	(PM_EVENT_USER | PM_EVENT_SUSPEND)
@@ -308,6 +310,7 @@ struct dev_pm_ops name = { \
 #define PMSG_THAW	((struct pm_message){ .event = PM_EVENT_THAW, })
 #define PMSG_RESTORE	((struct pm_message){ .event = PM_EVENT_RESTORE, })
 #define PMSG_RECOVER	((struct pm_message){ .event = PM_EVENT_RECOVER, })
+#define PMSG_CANCEL	((struct pm_message){ .event = PM_EVENT_CANCEL, })
 #define PMSG_USER_SUSPEND	((struct pm_message) \
 					{ .event = PM_EVENT_USER_SUSPEND, })
 #define PMSG_USER_RESUME	((struct pm_message) \
-- 
1.7.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

SUZUKI, Kazuhiro

2011-Feb-07 09:08 UTC

head link

[Xen-devel] [PATCH 2/2] Fix hangup after creating checkpoint on Xen.

Hi,

This is a Xen part patch.

Thanks,
KAZ

Signed-off-by: Kenji Wakamiya <wkenji@jp.fujitsu.com>
Signed-off-by: Kazuhiro Suzuki <kaz@jp.fujitsu.com>
---
 drivers/net/xen-netfront.c                 |    2 +-
 drivers/xen/manage.c                       |    7 ++++---
 drivers/xen/xenbus/xenbus_probe.c          |   12 ++++++++++--
 drivers/xen/xenbus/xenbus_probe.h          |    3 ++-
 drivers/xen/xenbus/xenbus_probe_frontend.c |    8 ++++++--
 include/xen/xenbus.h                       |    2 +-
 6 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 3f71199..22c6288 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1293,7 +1293,7 @@ static void xennet_disconnect_backend(struct netfront_info
*info)
 	info->rx.sring = NULL;
 }
 
-static int netfront_suspend(struct xenbus_device *dev, pm_message_t state)
+static int netfront_suspend(struct xenbus_device *dev)
 {
 	struct netfront_info *info = dev_get_drvdata(&dev->dev);
 	struct hrtimer *timer = &info->smart_poll.timer;
diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index 0b50906..845afb8 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -148,10 +148,11 @@ out_resume:
 	if (!cancelled) {
 		xen_arch_resume();
 		xs_resume();
-	} else
+		dpm_resume_end(PMSG_RESUME);
+	} else {
 		xs_suspend_cancel();
-
-	dpm_resume_end(PMSG_RESUME);
+		dpm_resume_end(PMSG_CANCEL);
+	}
 
 	/* Make sure timer events get retriggered on all CPUs */
 	clock_was_set();
diff --git a/drivers/xen/xenbus/xenbus_probe.c
b/drivers/xen/xenbus/xenbus_probe.c
index 3a83ba2..02dbb8b 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -575,7 +575,7 @@ void xenbus_dev_changed(const char *node, struct
xen_bus_type *bus)
 }
 EXPORT_SYMBOL_GPL(xenbus_dev_changed);
 
-int xenbus_dev_suspend(struct device *dev, pm_message_t state)
+int xenbus_dev_suspend(struct device *dev)
 {
 	int err = 0;
 	struct xenbus_driver *drv;
@@ -587,7 +587,7 @@ int xenbus_dev_suspend(struct device *dev, pm_message_t
state)
 		return 0;
 	drv = to_xenbus_driver(dev->driver);
 	if (drv->suspend)
-		err = drv->suspend(xdev, state);
+		err = drv->suspend(xdev);
 	if (err)
 		printk(KERN_WARNING
 		       "xenbus: suspend %s failed: %i\n", dev_name(dev), err);
@@ -638,6 +638,14 @@ int xenbus_dev_resume(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(xenbus_dev_resume);
 
+int xenbus_dev_cancel(struct device *dev)
+{
+	/* Do nothing */
+	DPRINTK("cancel");
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xenbus_dev_cancel);
+
 /* A flag to determine if xenstored is ''ready'' (i.e. has
started) */
 int xenstored_ready = 0;
 
diff --git a/drivers/xen/xenbus/xenbus_probe.h
b/drivers/xen/xenbus/xenbus_probe.h
index 0e5fc4c..4019187 100644
--- a/drivers/xen/xenbus/xenbus_probe.h
+++ b/drivers/xen/xenbus/xenbus_probe.h
@@ -62,8 +62,9 @@ extern void xenbus_dev_changed(const char *node, struct
xen_bus_type *bus);
 
 extern void xenbus_dev_shutdown(struct device *_dev);
 
-extern int xenbus_dev_suspend(struct device *dev, pm_message_t state);
+extern int xenbus_dev_suspend(struct device *dev);
 extern int xenbus_dev_resume(struct device *dev);
+extern int xenbus_dev_cancel(struct device *dev);
 
 extern void xenbus_otherend_changed(struct xenbus_watch *watch,
 				    const char **vec, unsigned int len,
diff --git a/drivers/xen/xenbus/xenbus_probe_frontend.c
b/drivers/xen/xenbus/xenbus_probe_frontend.c
index 5413248..6dfc588 100644
--- a/drivers/xen/xenbus/xenbus_probe_frontend.c
+++ b/drivers/xen/xenbus/xenbus_probe_frontend.c
@@ -82,6 +82,11 @@ static struct device_attribute xenbus_frontend_dev_attrs[] =
{
 	__ATTR_NULL
 };
 
+static struct dev_pm_ops xenbus_pm_ops = {
+	.suspend = xenbus_dev_suspend,
+	.resume  = xenbus_dev_resume,
+	.cancel  = xenbus_dev_cancel,
+};
 
 static struct xen_bus_type xenbus_frontend = {
 	.root = "device",
@@ -98,8 +103,7 @@ static struct xen_bus_type xenbus_frontend = {
 		.shutdown = xenbus_dev_shutdown,
 		.dev_attrs= xenbus_frontend_dev_attrs,
 
-		.suspend  = xenbus_dev_suspend,
-		.resume   = xenbus_dev_resume,
+		.pm       = &xenbus_pm_ops,
 	},
 };
 
diff --git a/include/xen/xenbus.h b/include/xen/xenbus.h
index 542ca7c..23e7f25 100644
--- a/include/xen/xenbus.h
+++ b/include/xen/xenbus.h
@@ -91,7 +91,7 @@ struct xenbus_driver {
 	void (*otherend_changed)(struct xenbus_device *dev,
 				 enum xenbus_state backend_state);
 	int (*remove)(struct xenbus_device *dev);
-	int (*suspend)(struct xenbus_device *dev, pm_message_t state);
+	int (*suspend)(struct xenbus_device *dev);
 	int (*resume)(struct xenbus_device *dev);
 	int (*uevent)(struct xenbus_device *, struct kobj_uevent_env *);
 	struct device_driver driver;
-- 
1.7.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Rafael J. Wysocki

2011-Feb-07 09:35 UTC

head link

[Xen-devel] Re: [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Monday, February 07, 2011, SUZUKI, Kazuhiro wrote:> Hi,
Hi,
> The following patch series fixes hangup after creating checkpoint on
> Xen. The Linux Xen guest can be saved the state to restore later, and
> also created snapshot like checkpoint via the hypervisor.
> But, when the snapshot is created for the PV guest, it will hangup.
> 
> We added ''PMSG_CANCEL'' message and
''cancel'' handler in dev_pm_ops
> struct in the pm-linux part.
Please don''t do that, unless you can convince me there''s no
other way to fix
the problem you''re trying to address.

In my opinion it''s highly unrealistic to assume that device drivers
(or even subsystems) will implement the ->cancel() callback just for the
benefit of Xen.

And if the only subsystem that needs to implement ->cancel() is Xen, then the
issue should be addressed without modifying the device core code, in a different
way.
> In creating checkpoint mode, the resume handler of xenbus should not
> be called. In this case, it is recognized that the suspend was canceled
> in drivers/xen/manage.c and call dpm_resume_end() with PMSG_CANCEL.
> If the ''cancel'' handler is defined, it is called instead
of resume().
> 
> [1/2] - Fix hangup after creating checkpoint on Xen -- pm-linux part.
> [2/2] - Fix hangup after creating checkpoint on Xen -- Xen part.
Thanks,
Rafael

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Feb-08 11:22 UTC

head link

[Xen-devel] Re: [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Mon, 2011-02-07 at 10:35 +0100, Rafael J. Wysocki
wrote:> On Monday, February 07, 2011, SUZUKI, Kazuhiro wrote:
> > Hi,
> 
> Hi,
> 
> > The following patch series fixes hangup after creating checkpoint on
> > Xen. The Linux Xen guest can be saved the state to restore later, and
> > also created snapshot like checkpoint via the hypervisor.
> > But, when the snapshot is created for the PV guest, it will hangup.
> > 
> > We added ''PMSG_CANCEL'' message and
''cancel'' handler in dev_pm_ops
> > struct in the pm-linux part.
> 
> Please don''t do that, unless you can convince me there''s
no other way to fix
> the problem you''re trying to address.
Sorry, it was my advise to Kazuhiro that solving the underlying issue by
extending the core would be preferable to making Xen specific hacks.

The problem is that currently we have:

        dpm_suspend_start(PMSG_SUSPEND);
        
                dpm_suspend_noirq(PMSG_SUSPEND);
                        
                        sysdev_suspend(PMSG_SUSPEND);
                        /* suspend hypercall */
                        sysdev_resume();
                
                dpm_resume_noirq(PMSG_RESUME);
        
        dpm_resume_end(PMSG_RESUME);

However the suspend hypercall can return a value indicating that the
suspend didn''t actually happen (e.g. was cancelled). This is used e.g.
when checkpointing guests, because in that case you want the original
guest to continue. When the suspend didn''t happen the drivers need to
recover differently from if it did.

The originally proposed solution was to only call dpm_resume_end if the
suspend was not cancelled. My concern with this was that unbalancing the
dpm_suspend_* and dpm_resume_* did not seem like a correct interaction
with the core. For example dpm_suspend_* adds stuff to
dpm_suspended_list and if dpm_resume_* is not called presumably this all
gets out of sync for next time. Hence my suggestion to add a cancel
message type.
> In my opinion it''s highly unrealistic to assume that device
drivers
> (or even subsystems) will implement the ->cancel() callback just for the
> benefit of Xen.
I did vague wonder if a similar message might be of interest to e.g.
cancelling hibernations or similar.
> And if the only subsystem that needs to implement ->cancel() is Xen,
then the
> issue should be addressed without modifying the device core code, in a
different
> way.
I thought it would be preferable to make use of/extend core
functionality where possible but if that''s not the case we can find
another way.

Do you have any suggestions for how to correctly interact with the core
functions?

Is adding a suspend_cancel operation to just at the struct xenbus_driver
level and introducing a xen specific function to walk to the bus the
sort of thing you were thinking of? (it seems reasonable). Should we be
doing anything with dpm_*_list in that case?

(FWIW original thread is on xen-devel at
http://thread.gmane.org/gmane.comp.emulators.xen.devel/95265/)

Ian.

-- 
Ian Campbell
Current Noise: Crowbar - Dead Sun

Why on earth do people buy old bottles of wine when they can get a
fresh one for a quarter of the price?


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Alan Stern

2011-Feb-08 16:46 UTC

head link

[Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Tue, 8 Feb 2011, Ian Campbell wrote:
> The problem is that currently we have:
> 
>         dpm_suspend_start(PMSG_SUSPEND);
>         
>                 dpm_suspend_noirq(PMSG_SUSPEND);
>                         
>                         sysdev_suspend(PMSG_SUSPEND);
>                         /* suspend hypercall */
>                         sysdev_resume();
>                 
>                 dpm_resume_noirq(PMSG_RESUME);
>         
>         dpm_resume_end(PMSG_RESUME);
> 
> However the suspend hypercall can return a value indicating that the
> suspend didn''t actually happen (e.g. was cancelled). This is used
e.g.
> when checkpointing guests, because in that case you want the original
> guest to continue. When the suspend didn''t happen the drivers need
to
> recover differently from if it did.
That is odd, and it is quite different from the intended design of the 
PM core.  Drivers are supposed to put their devices into a known 
suspended state; then afterwards they put the devices back into an 
operational state.  What happens while the devices are in the suspended 
state isn''t supposed to matter -- the system transition can fail, but 
devices get treated exactly the same way as if it succeeded.

Why do your drivers need to recover differently based on the success of 
the hypercall?

Alan Stern


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Feb-08 17:35 UTC

head link

[Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Tue, 2011-02-08 at 11:46 -0500, Alan Stern wrote:> On Tue, 8 Feb 2011, Ian Campbell wrote:
> 
> > The problem is that currently we have:
> > 
> >         dpm_suspend_start(PMSG_SUSPEND);
> >         
> >                 dpm_suspend_noirq(PMSG_SUSPEND);
> >                         
> >                         sysdev_suspend(PMSG_SUSPEND);
> >                         /* suspend hypercall */
> >                         sysdev_resume();
> >                 
> >                 dpm_resume_noirq(PMSG_RESUME);
> >         
> >         dpm_resume_end(PMSG_RESUME);
> > 
> > However the suspend hypercall can return a value indicating that the
> > suspend didn''t actually happen (e.g. was cancelled). This is
used e.g.
> > when checkpointing guests, because in that case you want the original
> > guest to continue. When the suspend didn''t happen the drivers
need to
> > recover differently from if it did.
> 
> That is odd, and it is quite different from the intended design of the 
> PM core.  Drivers are supposed to put their devices into a known 
> suspended state; then afterwards they put the devices back into an 
> operational state.  What happens while the devices are in the suspended 
> state isn''t supposed to matter -- the system transition can fail,
but
> devices get treated exactly the same way as if it succeeded.
> 
> Why do your drivers need to recover differently based on the success of 
> the hypercall?
checkpointing isn''t really my area but AIUI you don''t want to
do a full
device teardown and reconnect like you would with a proper suspend
because of the time that takes which prevents you from doing continuous
rolling checkpoints at granularity which people want to implement
various disaster recovery schemes.

Hopefully one of the Xen checkpointing folks will chime in and explain
why this is not possible to achieve at the individual driver level (or,
even better, with a patch which does it that way ;-)).

Ian.

-- 
Ian Campbell
Current Noise: Buckcherry - King Of Kings

Beauty?  What''s that?
		-- Larry Wall in <199710221937.MAA25131@wall.org>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Brendan Cully

2011-Feb-09 23:16 UTC

head link

[Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Tuesday, 08 February 2011 at 17:35, Ian Campbell
wrote:> On Tue, 2011-02-08 at 11:46 -0500, Alan Stern wrote:
> > On Tue, 8 Feb 2011, Ian Campbell wrote:
> > 
> > > The problem is that currently we have:
> > > 
> > >         dpm_suspend_start(PMSG_SUSPEND);
> > >         
> > >                 dpm_suspend_noirq(PMSG_SUSPEND);
> > >                         
> > >                         sysdev_suspend(PMSG_SUSPEND);
> > >                         /* suspend hypercall */
> > >                         sysdev_resume();
> > >                 
> > >                 dpm_resume_noirq(PMSG_RESUME);
> > >         
> > >         dpm_resume_end(PMSG_RESUME);
> > > 
> > > However the suspend hypercall can return a value indicating that
the
> > > suspend didn''t actually happen (e.g. was cancelled).
This is used e.g.
> > > when checkpointing guests, because in that case you want the
original
> > > guest to continue. When the suspend didn''t happen the
drivers need to
> > > recover differently from if it did.
> > 
> > That is odd, and it is quite different from the intended design of the
> > PM core.  Drivers are supposed to put their devices into a known 
> > suspended state; then afterwards they put the devices back into an 
> > operational state.  What happens while the devices are in the
suspended
> > state isn''t supposed to matter -- the system transition can
fail, but
> > devices get treated exactly the same way as if it succeeded.
> > 
> > Why do your drivers need to recover differently based on the success
of
> > the hypercall?
> 
> checkpointing isn''t really my area but AIUI you don''t
want to do a full
> device teardown and reconnect like you would with a proper suspend
> because of the time that takes which prevents you from doing continuous
> rolling checkpoints at granularity which people want to implement
> various disaster recovery schemes.
> 
> Hopefully one of the Xen checkpointing folks will chime in and explain
> why this is not possible to achieve at the individual driver level (or,
> even better, with a patch which does it that way ;-)).
As Ian says, Xen has suspend_cancel because while the normal full
suspend/resume path works, it is much slower, and the work done during
resume is redundant. I don''t remember the numbers offhand, but when we
added suspend_cancel I think we could do full suspend/resume in under
100us, which was maybe a couple of orders of magnitude faster than
full resume (largely due to slow xenstore handshaking on resume,
IIRC). It made a big difference for our Remus HA project, which was
checkpointing tens of times per second.

I''d like to keep the fast resume option, and expect that it can be
contained entirely in Xen-specific code. I''ll try to get someone to
look into it here.

I think fast resume is somewhat orthogonal to the problem of hanging
on resume, which just sounds like a xen-specific bug in the slow
path.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Alan Stern

2011-Feb-09 23:42 UTC

head link

[Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Wed, 9 Feb 2011, Brendan Cully wrote:
> > > That is odd, and it is quite different from the intended design
of the
> > > PM core.  Drivers are supposed to put their devices into a known 
> > > suspended state; then afterwards they put the devices back into
an
> > > operational state.  What happens while the devices are in the
suspended
> > > state isn''t supposed to matter -- the system transition
can fail, but
> > > devices get treated exactly the same way as if it succeeded.
> > > 
> > > Why do your drivers need to recover differently based on the
success of
> > > the hypercall?
> > 
> > checkpointing isn''t really my area but AIUI you
don''t want to do a full
> > device teardown and reconnect like you would with a proper suspend
> > because of the time that takes which prevents you from doing
continuous
> > rolling checkpoints at granularity which people want to implement
> > various disaster recovery schemes.
> > 
> > Hopefully one of the Xen checkpointing folks will chime in and explain
> > why this is not possible to achieve at the individual driver level
(or,
> > even better, with a patch which does it that way ;-)).
> 
> As Ian says, Xen has suspend_cancel because while the normal full
> suspend/resume path works, it is much slower, and the work done during
> resume is redundant. I don''t remember the numbers offhand, but
when we
> added suspend_cancel I think we could do full suspend/resume in under
> 100us, which was maybe a couple of orders of magnitude faster than
> full resume (largely due to slow xenstore handshaking on resume,
> IIRC). It made a big difference for our Remus HA project, which was
> checkpointing tens of times per second.
> 
> I''d like to keep the fast resume option, and expect that it can be
> contained entirely in Xen-specific code. I''ll try to get someone
to
> look into it here.
> 
> I think fast resume is somewhat orthogonal to the problem of hanging
> on resume, which just sounds like a xen-specific bug in the slow
> path.
In fact there already is a "fast suspend & resume" path in the PM
core.
It''s the freeze/thaw procedure used when starting to hibernate.  The
documentation specifically says that drivers'' freeze methods are
supposed to quiesce their devices but not change power levels.  In
addition, the thaw method is invoked as part of recovery from a failed
hibernation attempt, so it already has the "cancel" semantics that xen
seems to want.

Alan Stern


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Feb-10 11:31 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Wed, 2011-02-09 at 23:16 +0000, Brendan Cully wrote:> I''d like to keep the fast resume option, and expect that it can be
> contained entirely in Xen-specific code. I''ll try to get someone
to
> look into it here.
Thanks, please put them in contact with Kazuhiro who has already been
looking into this.
> I think fast resume is somewhat orthogonal to the problem of hanging
> on resume, which just sounds like a xen-specific bug in the slow
> path.
The bug on the Xen side is taking the slow path when the suspend was
actually cancelled, which is what the original patch tries to fix.

Or should/could the slow path be able cope either way?

Ian.

-- 
Ian Campbell
Current Noise: Neil Young - Love And War

For every complex problem, there is a solution that is simple, neat, and wrong.
		-- H. L. Mencken


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Feb-10 11:40 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Wed, 2011-02-09 at 23:42 +0000, Alan Stern wrote:> In fact there already is a "fast suspend & resume" path in
the PM core.
> It''s the freeze/thaw procedure used when starting to hibernate. 
The
> documentation specifically says that drivers'' freeze methods are
> supposed to quiesce their devices but not change power levels.  In
> addition, the thaw method is invoked as part of recovery from a failed
> hibernation attempt, so it already has the "cancel" semantics
that xen
> seems to want.
Sounds like that would work and I would much prefer to simply make
correct use of the core functionality.

So PMSG_FREEZE is balanced by either PMSG_RECOVER or PMSG_THAW depending
on whether the suspend was cancelled or not? So the sequence of events
is something like:
	dpm_suspend_start(PMSG_FREEZE);
         
		dpm_suspend_noirq(PMSG_FREEZE);
                         
			sysdev_suspend(PMSG_QUIESCE);
			cancelled = suspend_hypercall()
			sysdev_resume();
                 
		dpm_resume_noirq(cancelled ? PMSG_RECOVER : PMSG_THAW);
         
	dpm_resume_end(cancelled ? PMSG_RECOVER : PMSG_THAW);
?

(For comparison we currently have:> > >         dpm_suspend_start(PMSG_SUSPEND);
> > >         
> > >                 dpm_suspend_noirq(PMSG_SUSPEND);
> > >                         
> > >                         sysdev_suspend(PMSG_SUSPEND);
> > >                         /* suspend hypercall */
> > >                         sysdev_resume();
> > >                 
> > >                 dpm_resume_noirq(PMSG_RESUME);
> > >         
> > >         dpm_resume_end(PMSG_RESUME);)

Ian.
-- 
Ian Campbell
Current Noise: Neil Young - Angry World

If only one could get that wonderful feeling of accomplishment without
having to accomplish anything.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Feb-10 12:40 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

(cutting CC to just xen-devel)

On Thu, 2011-02-10 at 11:31 +0000, Ian Campbell wrote:> On Wed, 2011-02-09 at 23:16 +0000, Brendan Cully wrote:
> > I''d like to keep the fast resume option, and expect that it
can be
> > contained entirely in Xen-specific code. I''ll try to get
someone to
> > look into it here.
On that note: Is there someone around who is willing to more actively
and visibly maintain Remus? i.e. make it work with the new toolstack,
upstream it to the mainline kernel, work to get it into a state where it
can be tested as a matter of course, and generally be seen to be
improving it (and its integration/interaction with the rest of the
ecosystem) with time?

At the moment the impression is that Remus is mostly being left to rot
apart from when noise gets made about specific issues.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Alan Stern

2011-Feb-10 16:00 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Thu, 10 Feb 2011, Ian Campbell wrote:
> On Wed, 2011-02-09 at 23:42 +0000, Alan Stern wrote:
> > In fact there already is a "fast suspend & resume" path
in the PM core.
> > It''s the freeze/thaw procedure used when starting to
hibernate.  The
> > documentation specifically says that drivers'' freeze methods
are
> > supposed to quiesce their devices but not change power levels.  In
> > addition, the thaw method is invoked as part of recovery from a failed
> > hibernation attempt, so it already has the "cancel"
semantics that xen
> > seems to want.
> 
> Sounds like that would work and I would much prefer to simply make
> correct use of the core functionality.
It seems like a reasonable approach.  Whether it will actually _work_ 
is a harder question...  :-)
> So PMSG_FREEZE is balanced by either PMSG_RECOVER or PMSG_THAW depending
> on whether the suspend was cancelled or not?
Basically yes.  It is also "balanced" by PMSG_RESTORE, which is used
after a memory image has been restored (although this isn''t relevant to
your snapshotting).  See the comments in include/linux/pm.h.
>  So the sequence of events
> is something like:
> 	dpm_suspend_start(PMSG_FREEZE);
>          
> 		dpm_suspend_noirq(PMSG_FREEZE);
>                          
> 			sysdev_suspend(PMSG_QUIESCE);
This should say sysdev_suspend(PMSG_FREEZE).
> 			cancelled = suspend_hypercall()
At this point swsusp_arch_suspend() is called.  If that translates to 
suspend_hypercall() in your setting, then yes.
> 			sysdev_resume();
>                  
> 		dpm_resume_noirq(cancelled ? PMSG_RECOVER : PMSG_THAW);
>          
> 	dpm_resume_end(cancelled ? PMSG_RECOVER : PMSG_THAW);
> ?
Yes.
> (For comparison we currently have:
> > > >         dpm_suspend_start(PMSG_SUSPEND);
> > > >         
> > > >                 dpm_suspend_noirq(PMSG_SUSPEND);
> > > >                         
> > > >                         sysdev_suspend(PMSG_SUSPEND);
> > > >                         /* suspend hypercall */
> > > >                         sysdev_resume();
> > > >                 
> > > >                 dpm_resume_noirq(PMSG_RESUME);
> > > >         
> > > >         dpm_resume_end(PMSG_RESUME);
> )
Right.  The sequence of calls is the same, but the PMSG_ argument is 
different so drivers are expected to act differently in response.

Alan Stern


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Rafael J. Wysocki

2011-Feb-10 16:26 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Thursday, February 10, 2011, Alan Stern wrote:> On Thu, 10 Feb 2011, Ian Campbell wrote:
> 
> > On Wed, 2011-02-09 at 23:42 +0000, Alan Stern wrote:
> > > In fact there already is a "fast suspend & resume"
path in the PM core.
> > > It''s the freeze/thaw procedure used when starting to
hibernate.  The
> > > documentation specifically says that drivers'' freeze
methods are
> > > supposed to quiesce their devices but not change power levels. 
In
> > > addition, the thaw method is invoked as part of recovery from a
failed
> > > hibernation attempt, so it already has the "cancel"
semantics that xen
> > > seems to want.
> > 
> > Sounds like that would work and I would much prefer to simply make
> > correct use of the core functionality.
> 
> It seems like a reasonable approach.  Whether it will actually _work_ 
> is a harder question...  :-)
> 
> > So PMSG_FREEZE is balanced by either PMSG_RECOVER or PMSG_THAW
depending
> > on whether the suspend was cancelled or not?
That''s correct, but from drivers'' point of view PMSG_THAW is
equivalent to
PMSG_RECOVER, because the both of them cause ->thaw() callbacks to be
executed.
> Basically yes.  It is also "balanced" by PMSG_RESTORE, which is
used
> after a memory image has been restored (although this isn''t
relevant to
> your snapshotting).  See the comments in include/linux/pm.h.
Yup.
> >  So the sequence of events
> > is something like:
> > 	dpm_suspend_start(PMSG_FREEZE);
> >          
> > 		dpm_suspend_noirq(PMSG_FREEZE);
> >                          
> > 			sysdev_suspend(PMSG_QUIESCE);
> 
> This should say sysdev_suspend(PMSG_FREEZE).
Yes, PMSG_QUIESCE is restore-specific.
> > 			cancelled = suspend_hypercall()
> 
> At this point swsusp_arch_suspend() is called.  If that translates to 
> suspend_hypercall() in your setting, then yes.
> 
> > 			sysdev_resume();
> >                  
> > 		dpm_resume_noirq(cancelled ? PMSG_RECOVER : PMSG_THAW);
> >          
> > 	dpm_resume_end(cancelled ? PMSG_RECOVER : PMSG_THAW);
> > ?
> 
> Yes.
Actually, I think PMSG_THAW can be used in both cases.  The resume-side
routines only use the ''state'' argument for diagnostics.
> > (For comparison we currently have:
> > > > >         dpm_suspend_start(PMSG_SUSPEND);
> > > > >         
> > > > >                 dpm_suspend_noirq(PMSG_SUSPEND);
> > > > >                         
> > > > >                         sysdev_suspend(PMSG_SUSPEND);
> > > > >                         /* suspend hypercall */
> > > > >                         sysdev_resume();
> > > > >                 
> > > > >                 dpm_resume_noirq(PMSG_RESUME);
> > > > >         
> > > > >         dpm_resume_end(PMSG_RESUME);
> > )
> 
> Right.  The sequence of calls is the same, but the PMSG_ argument is 
> different so drivers are expected to act differently in response.
That''s correct.

Thanks,
Rafael

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Feb-10 16:34 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Thu, 2011-02-10 at 11:00 -0500, Alan Stern wrote:> On Thu, 10 Feb 2011, Ian Campbell wrote:
> 
> > On Wed, 2011-02-09 at 23:42 +0000, Alan Stern wrote:
> > > In fact there already is a "fast suspend & resume"
path in the PM core.
> > > It''s the freeze/thaw procedure used when starting to
hibernate.  The
> > > documentation specifically says that drivers'' freeze
methods are
> > > supposed to quiesce their devices but not change power levels. 
In
> > > addition, the thaw method is invoked as part of recovery from a
failed
> > > hibernation attempt, so it already has the "cancel"
semantics that xen
> > > seems to want.
> > 
> > Sounds like that would work and I would much prefer to simply make
> > correct use of the core functionality.
> 
> It seems like a reasonable approach.  Whether it will actually _work_ 
> is a harder question...  :-)
Heh.
> > So PMSG_FREEZE is balanced by either PMSG_RECOVER or PMSG_THAW
depending
> > on whether the suspend was cancelled or not?
> 
> Basically yes.  It is also "balanced" by PMSG_RESTORE, which is
used
> after a memory image has been restored (although this isn''t
relevant to
> your snapshotting).  See the comments in include/linux/pm.h.
The documentation of the individual events in pm.h is good. Is there a
reference for the sequence of events for the different types of
suspend/hibernate/etc?
> >  So the sequence of events
> > is something like:
> > 	dpm_suspend_start(PMSG_FREEZE);
> >          
> > 		dpm_suspend_noirq(PMSG_FREEZE);
> >                          
> > 			sysdev_suspend(PMSG_QUIESCE);
> 
> This should say sysdev_suspend(PMSG_FREEZE).
> 
> > 			cancelled = suspend_hypercall()
> 
> At this point swsusp_arch_suspend() is called.  If that translates to 
> suspend_hypercall() in your setting, then yes.
> 
> > 			sysdev_resume();
> >                  
> > 		dpm_resume_noirq(cancelled ? PMSG_RECOVER : PMSG_THAW);
> >          
> > 	dpm_resume_end(cancelled ? PMSG_RECOVER : PMSG_THAW);
> > ?
> 
> Yes.
Both of those call ->thaw ->complete. Did I mean "cancelled ?
PMSG_THAW : PMSG_RESTORE"? (or s/THAW/RECOVER?)

If the suspend was cancelled then we want the devices to simply pickup
where they were before the freeze, wereas if we really did suspend (or
migrate or whatever) then they need to do a more complete reset and
reconnect operation so we want some sort of indication to the driver
which happened.
> > (For comparison we currently have:
> > > > >         dpm_suspend_start(PMSG_SUSPEND);
> > > > >         
> > > > >                 dpm_suspend_noirq(PMSG_SUSPEND);
> > > > >                         
> > > > >                         sysdev_suspend(PMSG_SUSPEND);
> > > > >                         /* suspend hypercall */
> > > > >                         sysdev_resume();
> > > > >                 
> > > > >                 dpm_resume_noirq(PMSG_RESUME);
> > > > >         
> > > > >         dpm_resume_end(PMSG_RESUME);
> > )
> 
> Right.  The sequence of calls is the same, but the PMSG_ argument is 
> different so drivers are expected to act differently in response.
The drivers don''t actually see the PMSG_* though right? They only see a
differing sequence of hooks from dev_pm_ops called.

Thanks,
Ian.
> 
> Alan Stern
> 
> 
-- 
Ian Campbell
Current Noise: Sworn Amongst - Useless

Accident, n.:
	A condition in which presence of mind is good, but absence of
	body is better.
		-- Foolish Dictionary


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Rafael J. Wysocki

2011-Feb-10 17:01 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Thursday, February 10, 2011, Ian Campbell wrote:> On Thu, 2011-02-10 at 11:00 -0500, Alan Stern wrote:
> > On Thu, 10 Feb 2011, Ian Campbell wrote:
> > 
> > > On Wed, 2011-02-09 at 23:42 +0000, Alan Stern wrote:
> > > > In fact there already is a "fast suspend &
resume" path in the PM core.
> > > > It''s the freeze/thaw procedure used when starting
to hibernate.  The
> > > > documentation specifically says that drivers''
freeze methods are
> > > > supposed to quiesce their devices but not change power
levels.  In
> > > > addition, the thaw method is invoked as part of recovery
from a failed
> > > > hibernation attempt, so it already has the
"cancel" semantics that xen
> > > > seems to want.
> > > 
> > > Sounds like that would work and I would much prefer to simply
make
> > > correct use of the core functionality.
> > 
> > It seems like a reasonable approach.  Whether it will actually _work_ 
> > is a harder question...  :-)
> 
> Heh.
> 
> > > So PMSG_FREEZE is balanced by either PMSG_RECOVER or PMSG_THAW
depending
> > > on whether the suspend was cancelled or not?
> > 
> > Basically yes.  It is also "balanced" by PMSG_RESTORE, which
is used
> > after a memory image has been restored (although this isn''t
relevant to
> > your snapshotting).  See the comments in include/linux/pm.h.
> 
> The documentation of the individual events in pm.h is good. Is there a
> reference for the sequence of events for the different types of
> suspend/hibernate/etc?
> 
> > >  So the sequence of events
> > > is something like:
> > > 	dpm_suspend_start(PMSG_FREEZE);
> > >          
> > > 		dpm_suspend_noirq(PMSG_FREEZE);
> > >                          
> > > 			sysdev_suspend(PMSG_QUIESCE);
> > 
> > This should say sysdev_suspend(PMSG_FREEZE).
> > 
> > > 			cancelled = suspend_hypercall()
> > 
> > At this point swsusp_arch_suspend() is called.  If that translates to 
> > suspend_hypercall() in your setting, then yes.
> > 
> > > 			sysdev_resume();
> > >                  
> > > 		dpm_resume_noirq(cancelled ? PMSG_RECOVER : PMSG_THAW);
> > >          
> > > 	dpm_resume_end(cancelled ? PMSG_RECOVER : PMSG_THAW);
> > > ?
> > 
> > Yes.
> 
> Both of those call ->thaw ->complete. Did I mean "cancelled ?
> PMSG_THAW : PMSG_RESTORE"? (or s/THAW/RECOVER?)
> 
> If the suspend was cancelled then we want the devices to simply pickup
> where they were before the freeze, wereas if we really did suspend (or
> migrate or whatever) then they need to do a more complete reset and
> reconnect operation so we want some sort of indication to the driver
> which happened.
In that case you should probably use PMSG_THAW (or PMSG_RECOVER) for the
"cancel" case and PMSG_RESTORE for the "success" case
(pretty much what
hibernation does).

And please don''t forget to update the comments in pm.h to cover your
usage
case. :-)
> > > (For comparison we currently have:
> > > > > >         dpm_suspend_start(PMSG_SUSPEND);
> > > > > >         
> > > > > >                 dpm_suspend_noirq(PMSG_SUSPEND);
> > > > > >                         
> > > > > >                        
sysdev_suspend(PMSG_SUSPEND);
> > > > > >                         /* suspend hypercall */
> > > > > >                         sysdev_resume();
> > > > > >                 
> > > > > >                 dpm_resume_noirq(PMSG_RESUME);
> > > > > >         
> > > > > >         dpm_resume_end(PMSG_RESUME);
> > > )
> > 
> > Right.  The sequence of calls is the same, but the PMSG_ argument is 
> > different so drivers are expected to act differently in response.
> 
> The drivers don''t actually see the PMSG_* though right? They only
see a
> differing sequence of hooks from dev_pm_ops called.
That''s correct.

Thanks,
Rafael

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Brendan Cully

2011-Feb-10 17:53 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Thursday, 10 February 2011 at 11:31, Ian Campbell
wrote:> On Wed, 2011-02-09 at 23:16 +0000, Brendan Cully wrote:
> > I''d like to keep the fast resume option, and expect that it
can be
> > contained entirely in Xen-specific code. I''ll try to get
someone to
> > look into it here.
> 
> Thanks, please put them in contact with Kazuhiro who has already been
> looking into this.
> 
> > I think fast resume is somewhat orthogonal to the problem of hanging
> > on resume, which just sounds like a xen-specific bug in the slow
> > path.
> 
> The bug on the Xen side is taking the slow path when the suspend was
> actually cancelled, which is what the original patch tries to fix.
> 
> Or should/could the slow path be able cope either way?
I haven''t looked at what''s actually happening yet, but what
should be
happening is that the pvops kernel should not advertise SUSPEND_CANCEL
since it doesn''t yet support it. So xm save -c should call
xc_domain_resume with the fast argument set to 0, to do slow path
resumption, and that slow path should be able to handle resuming on
the same host where it suspended. Slow path resumption from checkpoint
was supported in the pre-pvops kernel.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Alan Stern

2011-Feb-10 18:56 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Thu, 10 Feb 2011, Ian Campbell wrote:
> The documentation of the individual events in pm.h is good. Is there a
> reference for the sequence of events for the different types of
> suspend/hibernate/etc?
There''s Documentation/power/devices.txt.

Alan Stern


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Brendan Cully

2011-Feb-10 19:31 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Thursday, 10 February 2011 at 12:40, Ian Campbell
wrote:> (cutting CC to just xen-devel)
> 
> On Thu, 2011-02-10 at 11:31 +0000, Ian Campbell wrote:
> > On Wed, 2011-02-09 at 23:16 +0000, Brendan Cully wrote:
> > > I''d like to keep the fast resume option, and expect that
it can be
> > > contained entirely in Xen-specific code. I''ll try to get
someone to
> > > look into it here.
> 
> On that note: Is there someone around who is willing to more actively
> and visibly maintain Remus? i.e. make it work with the new toolstack,
> upstream it to the mainline kernel, work to get it into a state where it
> can be tested as a matter of course, and generally be seen to be
> improving it (and its integration/interaction with the rest of the
> ecosystem) with time?
> 
> At the moment the impression is that Remus is mostly being left to rot
> apart from when noise gets made about specific issues.
I''d like to introduce Shriram Rajagopalan, another PhD student here at
UBC who has been doing a lot of work with Remus internally. Up to this
point he''s concentrated on new features that aren''t quite
ready for
upstream yet: checkpoint compression, DRBD integration, and failback
(live resync back to the primary after it has recovered from
failure).

He''s also started doing maintenance work for his own purposes that
would be beneficial for Xen: building a regression test suite, working
on pvops support, and planning xl integration. He''s excited to take on
a maintenance role, and ready to put together a plan for upstreaming
his work.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Feb-11 09:14 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Thu, 2011-02-10 at 19:31 +0000, Brendan Cully wrote:> On Thursday, 10 February 2011 at 12:40, Ian Campbell wrote:
> > (cutting CC to just xen-devel)
> > 
> > On Thu, 2011-02-10 at 11:31 +0000, Ian Campbell wrote:
> > > On Wed, 2011-02-09 at 23:16 +0000, Brendan Cully wrote:
> > > > I''d like to keep the fast resume option, and expect
that it can be
> > > > contained entirely in Xen-specific code. I''ll try
to get someone to
> > > > look into it here.
> > 
> > On that note: Is there someone around who is willing to more actively
> > and visibly maintain Remus? i.e. make it work with the new toolstack,
> > upstream it to the mainline kernel, work to get it into a state where
it
> > can be tested as a matter of course, and generally be seen to be
> > improving it (and its integration/interaction with the rest of the
> > ecosystem) with time?
> > 
> > At the moment the impression is that Remus is mostly being left to rot
> > apart from when noise gets made about specific issues.
> 
> I''d like to introduce Shriram Rajagopalan, another PhD student
here at
> UBC who has been doing a lot of work with Remus internally. Up to this
> point he''s concentrated on new features that aren''t quite
ready for
> upstream yet: checkpoint compression, DRBD integration, and failback
> (live resync back to the primary after it has recovered from
> failure).
> 
> He''s also started doing maintenance work for his own purposes that
> would be beneficial for Xen: building a regression test suite, working
> on pvops support, and planning xl integration. He''s excited to
take on
> a maintenance role, and ready to put together a plan for upstreaming
> his work.
Awesome! Welcome Shriram!

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Pasi Kärkkäinen

2011-Feb-11 09:37 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Fri, Feb 11, 2011 at 09:14:09AM +0000, Ian Campbell
wrote:> On Thu, 2011-02-10 at 19:31 +0000, Brendan Cully wrote:
> > On Thursday, 10 February 2011 at 12:40, Ian Campbell wrote:
> > > (cutting CC to just xen-devel)
> > > 
> > > On Thu, 2011-02-10 at 11:31 +0000, Ian Campbell wrote:
> > > > On Wed, 2011-02-09 at 23:16 +0000, Brendan Cully wrote:
> > > > > I''d like to keep the fast resume option, and
expect that it can be
> > > > > contained entirely in Xen-specific code. I''ll
try to get someone to
> > > > > look into it here.
> > > 
> > > On that note: Is there someone around who is willing to more
actively
> > > and visibly maintain Remus? i.e. make it work with the new
toolstack,
> > > upstream it to the mainline kernel, work to get it into a state
where it
> > > can be tested as a matter of course, and generally be seen to be
> > > improving it (and its integration/interaction with the rest of
the
> > > ecosystem) with time?
> > > 
> > > At the moment the impression is that Remus is mostly being left
to rot
> > > apart from when noise gets made about specific issues.
> > 
> > I''d like to introduce Shriram Rajagopalan, another PhD
student here at
> > UBC who has been doing a lot of work with Remus internally. Up to this
> > point he''s concentrated on new features that aren''t
quite ready for
> > upstream yet: checkpoint compression, DRBD integration, and failback
> > (live resync back to the primary after it has recovered from
> > failure).
> > 
> > He''s also started doing maintenance work for his own purposes
that
> > would be beneficial for Xen: building a regression test suite, working
> > on pvops support, and planning xl integration. He''s excited
to take on
> > a maintenance role, and ready to put together a plan for upstreaming
> > his work.
> 
> Awesome! Welcome Shriram!
> 
Indeed, very much welcome! Remus is discussed almost every day
on ##xen on irc, many people seem to want to use it!

-- Pasi


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Feb-11 09:51 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Thu, 2011-02-10 at 19:31 +0000, Brendan Cully wrote:> On Thursday, 10 February 2011 at 12:40, Ian Campbell wrote:
> > (cutting CC to just xen-devel)
> > 
> > On Thu, 2011-02-10 at 11:31 +0000, Ian Campbell wrote:
> > > On Wed, 2011-02-09 at 23:16 +0000, Brendan Cully wrote:
> > > > I''d like to keep the fast resume option, and expect
that it can be
> > > > contained entirely in Xen-specific code. I''ll try
to get someone to
> > > > look into it here.
> > 
> > On that note: Is there someone around who is willing to more actively
> > and visibly maintain Remus? i.e. make it work with the new toolstack,
> > upstream it to the mainline kernel, work to get it into a state where
it
> > can be tested as a matter of course, and generally be seen to be
> > improving it (and its integration/interaction with the rest of the
> > ecosystem) with time?
> > 
> > At the moment the impression is that Remus is mostly being left to rot
> > apart from when noise gets made about specific issues.
> 
> I''d like to introduce Shriram Rajagopalan,[...] He''s
excited to take on
> a maintenance role, and ready to put together a plan for upstreaming
> his work.
Shriram, are you happy to add your acked-by to the following?

I mostly guessed the most obvious path for the F: tag, feel free to
extend (I presume there are bits under tools/python too?). If you have a
Remus tree somewhere you can add a T: label too if you like. We can get
you a tree on xenbis.xen.org if you want.

Ian.

8<-------------------

MAINTAINERS: Add Remus maintainer.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>

diff -r 9e463cb15658 MAINTAINERS
--- a/MAINTAINERS	Mon Feb 07 17:02:46 2011 +0000
+++ b/MAINTAINERS	Fri Feb 11 09:46:38 2011 +0000
@@ -158,6 +158,11 @@ S:	Supported
 S:	Supported
 T:	git git://xenbits.xen.org/qemu-xen-*.git
 
+REMUS
+M:	Shriram Rajagopalan <rshriram@cs.ubc.ca>
+S:	Maintained
+F:	tools/remus/
+
 SCHEDULING
 M:	George Dunlap <george.dunlap@eu.citrix.com>
 S:	Supported



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Shriram Rajagopalan

2011-Feb-11 18:13 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Fri, Feb 11, 2011 at 1:51 AM, Ian Campbell
<Ian.Campbell@eu.citrix.com> wrote:> On Thu, 2011-02-10 at 19:31 +0000, Brendan Cully wrote:
>> On Thursday, 10 February 2011 at 12:40, Ian Campbell wrote:
>> > (cutting CC to just xen-devel)
>> >
>> > On Thu, 2011-02-10 at 11:31 +0000, Ian Campbell wrote:
>> > > On Wed, 2011-02-09 at 23:16 +0000, Brendan Cully wrote:
>> > > > I''d like to keep the fast resume option, and
expect that it can be
>> > > > contained entirely in Xen-specific code. I''ll
try to get someone to
>> > > > look into it here.
>> >
>> > On that note: Is there someone around who is willing to more
actively
>> > and visibly maintain Remus? i.e. make it work with the new
toolstack,
>> > upstream it to the mainline kernel, work to get it into a state
where it
>> > can be tested as a matter of course, and generally be seen to be
>> > improving it (and its integration/interaction with the rest of the
>> > ecosystem) with time?
>> >
>> > At the moment the impression is that Remus is mostly being left to
rot
>> > apart from when noise gets made about specific issues.
>>
>> I''d like to introduce Shriram Rajagopalan,[...] He''s
excited to take on
>> a maintenance role, and ready to put together a plan for upstreaming
>> his work.
>
> Shriram, are you happy to add your acked-by to the following?
>
> I mostly guessed the most obvious path for the F: tag, feel free to
> extend (I presume there are bits under tools/python too?). If you have a
> Remus tree somewhere you can add a T: label too if you like. We can get
> you a tree on xenbis.xen.org if you want.
>
> Ian.
>
> 8<-------------------
>
> MAINTAINERS: Add Remus maintainer.
>
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
>
> diff -r 9e463cb15658 MAINTAINERS
> --- a/MAINTAINERS       Mon Feb 07 17:02:46 2011 +0000
> +++ b/MAINTAINERS       Fri Feb 11 09:46:38 2011 +0000
> @@ -158,6 +158,11 @@ S: Supported
>  S:     Supported
>  T:     git git://xenbits.xen.org/qemu-xen-*.git
>
> +REMUS
> +M:     Shriram Rajagopalan <rshriram@cs.ubc.ca>
> +S:     Maintained
> +F:     tools/remus/
> +
>  SCHEDULING
>  M:     George Dunlap <george.dunlap@eu.citrix.com>
>  S:     Supported
>
>
>
Yep. You can add the acked-by.
About the F: tag, the following paths contain remus code
  tools/remus/
  tools/python/xen/remus/
  tools/python/xen/lowlevel/checkpoint/
  tools/blktap2/drivers/

The blktap2/drivers path has only few files (block-remus.c, hashtable* files)

About the tree, it would be great if I could get a tree. :)

shriram

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Feb-14 09:15 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Fri, 2011-02-11 at 18:13 +0000, Shriram Rajagopalan
wrote:> On Fri, Feb 11, 2011 at 1:51 AM, Ian Campbell
> <Ian.Campbell@eu.citrix.com> wrote:
> > On Thu, 2011-02-10 at 19:31 +0000, Brendan Cully wrote:
> >> On Thursday, 10 February 2011 at 12:40, Ian Campbell wrote:
> >> > (cutting CC to just xen-devel)
> >> >
> >> > On Thu, 2011-02-10 at 11:31 +0000, Ian Campbell wrote:
> >> > > On Wed, 2011-02-09 at 23:16 +0000, Brendan Cully wrote:
> >> > > > I''d like to keep the fast resume option,
and expect that it can be
> >> > > > contained entirely in Xen-specific code.
I''ll try to get someone to
> >> > > > look into it here.
> >> >
> >> > On that note: Is there someone around who is willing to more
actively
> >> > and visibly maintain Remus? i.e. make it work with the new
toolstack,
> >> > upstream it to the mainline kernel, work to get it into a
state where it
> >> > can be tested as a matter of course, and generally be seen to
be
> >> > improving it (and its integration/interaction with the rest
of the
> >> > ecosystem) with time?
> >> >
> >> > At the moment the impression is that Remus is mostly being
left to rot
> >> > apart from when noise gets made about specific issues.
> >>
> >> I''d like to introduce Shriram Rajagopalan,[...]
He''s excited to take on
> >> a maintenance role, and ready to put together a plan for
upstreaming
> >> his work.
> >
> > Shriram, are you happy to add your acked-by to the following?
> >[...]
> Yep. You can add the acked-by.
> About the F: tag, the following paths contain remus code
>   tools/remus/
>   tools/python/xen/remus/
>   tools/python/xen/lowlevel/checkpoint/
>   tools/blktap2/drivers/
> 
> The blktap2/drivers path has only few files (block-remus.c, hashtable*
files)
Thanks, I''ll resend the patch with this update and your ack.
> About the tree, it would be great if I could get a tree. :)
I''ve dropped you and the xenbits admins an email offline about this.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Feb-14 09:27 UTC

head link

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

On Mon, 2011-02-14 at 09:15 +0000, Ian Campbell wrote:> On Fri, 2011-02-11 at 18:13 +0000, Shriram Rajagopalan wrote:
> > About the F: tag, the following paths contain remus code
> >   tools/remus/
> >   tools/python/xen/remus/
> >   tools/python/xen/lowlevel/checkpoint/
> >   tools/blktap2/drivers/
> > 
> > The blktap2/drivers path has only few files (block-remus.c, hashtable*
files)
> 
> Thanks, I''ll resend the patch with this update and your ack.
Keir beat me to the punch with the original patch. I''ve sent an update
with the extra paths.

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Shriram Rajagopalan

2011-Feb-17 07:56 UTC

head link

[Xen-devel] [PATCH] update comments in pm.h describing Xen Guest save/restore/checkpoint use case

Add documentation to pm.h on how xen uses PM events (freeze,
thaw, restore) to implement Guest VM save/checkpoint/restore
functionality.

Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
---
 include/linux/pm.h |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/include/linux/pm.h b/include/linux/pm.h
index dd9c7ab..2ddd9d3 100644
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -516,6 +516,25 @@ extern void update_pm_runtime_accounting(struct device
*dev);
  * well as during system sleep states like PM_SUSPEND_STANDBY.  They may
  * be able to use wakeup events to exit from runtime low-power states,
  * or from system low-power states such as standby or suspend-to-RAM.
+ *
+ * Xen Guest Kernels use PM_FREEZE, PM_RESTORE and PM_THAW to implement
+ * VM save/restore/checkpoint functionality. Save and Restore are somewhat
+ * similar to hibernate functionality. The sequence of events is shown below:
+ *        dpm_suspend_start(PMSG_FREEZE);
+ *
+ *              dpm_suspend_noirq(PMSG_FREEZE);
+ *
+ *                     sysdev_suspend(PMSG_FREEZE);
+ *                     cancelled = suspend_hypercall()
+ *                     sysdev_resume();
+ *
+ *             dpm_resume_noirq(cancelled ? PMSG_THAW : PMSG_RESTORE);
+ *
+ *     dpm_resume_end(cancelled ? PMSG_THAW : PMSG_RESTORE);
+ *
+ * If the syspend_hypercall returns 1, it means that the VM was merely
+ * checkpointed (akin to THAW). If it returns 0, it means the system has been
+ * fully restored from its on-disk snapshot (akin to RESTORE).
  */
 
 #ifdef CONFIG_PM_SLEEP
-- 
1.7.0.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Feb-17 10:56 UTC

head link

Re: [Xen-devel] [PATCH] update comments in pm.h describing Xen Guest save/restore/checkpoint use case

On Thu, 2011-02-17 at 07:56 +0000, Shriram Rajagopalan
wrote:> Add documentation to pm.h on how xen uses PM events (freeze,
> thaw, restore) to implement Guest VM save/checkpoint/restore
> functionality.
The change to freeze/thaw/restore instead of suspend/resume raises the
question of what the correct .config option for Xen to key off is.

Currently the Xen suspend functionality in drivers/xen/manage.c is keyed
off CONFIG_PM_SLEEP (which depends on SUSPEND || HIBERNATION ||
XEN_SAVE_RESTORE).

Since PMSG_{SUSPEND,RESUME} are covered by CONFIG_SUSPEND in pm_op()
this already seems incorrect before this change, since PM_SLEEP can be
enabled without SUSPEND, and is equally incorrect after changing to
PMSG_{FREEZE,THAW,RESUME}, which are covered by CONFIG_HIBERNATION.

Is the correct fix to update pm_op to include
	|| defined(CONFIG_XEN_SAVE_RESTORE)
where appropriate in pm_op()? 

Or should we be looking to adjust the Kconfig and/or code on the Xen
side? Having Xen depend on HIBERNATION ="Hibernation (aka ''suspend
to
disk'')" seems semantically incorrect.

Ian.
> 
> Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
> ---
>  include/linux/pm.h |   19 +++++++++++++++++++
>  1 files changed, 19 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/pm.h b/include/linux/pm.h
> index dd9c7ab..2ddd9d3 100644
> --- a/include/linux/pm.h
> +++ b/include/linux/pm.h
> @@ -516,6 +516,25 @@ extern void update_pm_runtime_accounting(struct device
*dev);
>   * well as during system sleep states like PM_SUSPEND_STANDBY.  They may
>   * be able to use wakeup events to exit from runtime low-power states,
>   * or from system low-power states such as standby or suspend-to-RAM.
> + *
> + * Xen Guest Kernels use PM_FREEZE, PM_RESTORE and PM_THAW to implement
> + * VM save/restore/checkpoint functionality. Save and Restore are somewhat
> + * similar to hibernate functionality. The sequence of events is shown
below:
> + *        dpm_suspend_start(PMSG_FREEZE);
> + *
> + *              dpm_suspend_noirq(PMSG_FREEZE);
> + *
> + *                     sysdev_suspend(PMSG_FREEZE);
> + *                     cancelled = suspend_hypercall()
> + *                     sysdev_resume();
> + *
> + *             dpm_resume_noirq(cancelled ? PMSG_THAW : PMSG_RESTORE);
> + *
> + *     dpm_resume_end(cancelled ? PMSG_THAW : PMSG_RESTORE);
> + *
> + * If the syspend_hypercall returns 1, it means that the VM was merely
> + * checkpointed (akin to THAW). If it returns 0, it means the system has
been
> + * fully restored from its on-disk snapshot (akin to RESTORE).
>   */
>  
>  #ifdef CONFIG_PM_SLEEP


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Apparently Analagous Threads

Search for more possibly parallel threads

Xen devel - Feb 2011 - [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

[Xen-devel] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

[Xen-devel] [PATCH 1/2] Fix hangup after creating checkpoint on Xen.

[Xen-devel] [PATCH 2/2] Fix hangup after creating checkpoint on Xen.

[Xen-devel] Re: [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

[Xen-devel] Re: [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

[Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

[Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

[Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

[Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen.

[Xen-devel] [PATCH] update comments in pm.h describing Xen Guest save/restore/checkpoint use case

Re: [Xen-devel] [PATCH] update comments in pm.h describing Xen Guest save/restore/checkpoint use case

Apparently Analagous Threads