Hi. When a domain is resumed after aborted save (e.g., because of insufficient space on a device where the domain is being saved), xm console cannot send/read any data. The reason is that the event channel used by xenconsole stays unbound. This patch modifies xenconsoled to check current status of open event channels and rebind them if necessary. Signed-off-by: Jiri Denemark <jdenemar@redhat.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Gerd Hoffmann
2009-Apr-20  13:38 UTC
Re: [Xen-devel] [PATCH] Fix xenconsole after aborted save
On 04/20/09 15:23, Jiri Denemark wrote:> Hi. > > When a domain is resumed after aborted save (e.g., because of insufficient > space on a device where the domain is being saved), xm console cannot > send/read any data. The reason is that the event channel used by xenconsole > stays unbound. > > This patch modifies xenconsoled to check current status of open event channels > and rebind them if necessary.close() + open() is the sledge hammer approach (will work though). Just unbind(local_port) should be enough. cheers, Gerd _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jiri Denemark
2009-Apr-20  14:04 UTC
Re: [Xen-devel] [PATCH] Fix xenconsole after aborted save
On Mon, Apr 20, 2009 at 15:38:41 +0200, Gerd Hoffmann wrote:> On 04/20/09 15:23, Jiri Denemark wrote: >> Hi. >> >> When a domain is resumed after aborted save (e.g., because of insufficient >> space on a device where the domain is being saved), xm console cannot >> send/read any data. The reason is that the event channel used by xenconsole >> stays unbound. >> >> This patch modifies xenconsoled to check current status of open event channels >> and rebind them if necessary. > > close() + open() is the sledge hammer approach (will work though). Just > unbind(local_port) should be enough.It doesn''t close() and open(), it just calls xc_evtchn_bind_interdomain() in case the event channel is unbound. The close() + open() combination was there before... I haven''t touched that code except for skipping it when only rebind is required. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Gerd Hoffmann
2009-Apr-20  14:09 UTC
Re: [Xen-devel] [PATCH] Fix xenconsole after aborted save
On 04/20/09 16:04, Jiri Denemark wrote:> On Mon, Apr 20, 2009 at 15:38:41 +0200, Gerd Hoffmann wrote: >> On 04/20/09 15:23, Jiri Denemark wrote: >>> This patch modifies xenconsoled to check current status of open event channels >>> and rebind them if necessary. >> close() + open() is the sledge hammer approach (will work though). Just >> unbind(local_port) should be enough. > > It doesn''t close() and open(), it just calls xc_evtchn_bind_interdomain() in > case the event channel is unbound. The close() + open() combination was there > before... I haven''t touched that code except for skipping it when only rebind > is required.Oh, ok, got the logic wrong. The patch looks fine to me then. cheers, Gerd _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Apr-20  14:41 UTC
Re: [Xen-devel] [PATCH] Fix xenconsole after aborted save
On 20/04/2009 15:04, "Jiri Denemark" <jdenemar@redhat.com> wrote:>>> This patch modifies xenconsoled to check current status of open event >>> channels >>> and rebind them if necessary. >> >> close() + open() is the sledge hammer approach (will work though). Just >> unbind(local_port) should be enough. > > It doesn''t close() and open(), it just calls xc_evtchn_bind_interdomain() in > case the event channel is unbound. The close() + open() combination was there > before... I haven''t touched that code except for skipping it when only rebind > is required.And actually that is a bug, since you will leak the old dom->local_port. I checked in an alternative patch as c/s 19561, so please take a look and test that resolves your issue. Another thing to note is I think this problem can only occur if the domU does not support suspend cancellation (advertised as SUSPEND_CANCEL in kernel elf notes -- see xen/xend/XendDomainInfo.py:resumeDomain()). Your kernels should support that feature -- suspend cancellation (a.k.a. Resume) is very likely to be hit-or-miss without it! -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jiri Denemark
2009-Apr-20  15:58 UTC
Re: [Xen-devel] [PATCH] Fix xenconsole after aborted save
> > It doesn''t close() and open(), it just calls xc_evtchn_bind_interdomain() in > > case the event channel is unbound. The close() + open() combination was there > > before... I haven''t touched that code except for skipping it when only rebind > > is required. > > And actually that is a bug, since you will leak the old dom->local_port.Ah, ok, thanks for fixing that...> I checked in an alternative patch as c/s 19561, so please take a look and > test that resolves your issue.Yes, the alternative patch works fine.> Another thing to note is I think this problem can only occur if the domU > does not support suspend cancellationSure, the problem only occurs for kernels without SUSPEND_CANCEL feature. Jirka _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Chris Lalancette
2009-Apr-22  09:49 UTC
Re: [Xen-devel] [PATCH] Fix xenconsole after aborted save
Keir Fraser wrote:> Another thing to note is I think this problem can only occur if the domU > does not support suspend cancellation (advertised as SUSPEND_CANCEL in > kernel elf notes -- see xen/xend/XendDomainInfo.py:resumeDomain()). Your > kernels should support that feature -- suspend cancellation (a.k.a. Resume) > is very likely to be hit-or-miss without it!Could you elaborate a bit on this? I was under the impression that suspend cancellation was there mostly for the netaccel bits, but I have to admit I didn''t look at it very closely. What scenarios do the suspend cancel bits help? -- Chris Lalancette _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Apr-22  10:22 UTC
Re: [Xen-devel] [PATCH] Fix xenconsole after aborted save
On 22/04/2009 10:49, "Chris Lalancette" <clalance@redhat.com> wrote:> Keir Fraser wrote: >> Another thing to note is I think this problem can only occur if the domU >> does not support suspend cancellation (advertised as SUSPEND_CANCEL in >> kernel elf notes -- see xen/xend/XendDomainInfo.py:resumeDomain()). Your >> kernels should support that feature -- suspend cancellation (a.k.a. Resume) >> is very likely to be hit-or-miss without it! > > Could you elaborate a bit on this? I was under the impression that suspend > cancellation was there mostly for the netaccel bits, but I have to admit I > didn''t look at it very closely. What scenarios do the suspend cancel bits > help?Suspend failures (failure to save or to migrate). Also for live checkpointing/snapshotting. The feature indicates that the guest is happy for the suspend hypercall to return indicating ''failure/cancelled'' and in which case it can pretty much resume whatever it was doing without any of the usual resume logic. The alternative is for the toolstack to make it look like the domain has been restored/migrated, by resetting PV devices and the like, and this is not much tested and almost inherently fragile. The suspend_cancel hook you are thinking of is for any PV devices which *do* need to know that a suspend was cancelled. Netaccel does for some reason which I cannot recall. Anyway, it is pretty easy and pretty important to support SUSPEND_CANCEL! -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
John Levon
2009-Apr-22  12:41 UTC
Re: [Xen-devel] [PATCH] Fix xenconsole after aborted save
On Wed, Apr 22, 2009 at 11:22:01AM +0100, Keir Fraser wrote:> Anyway, it is pretty easy and pretty important to support SUSPEND_CANCEL!Note that it''s still the case that Solaris doesn''t, unfortunately. regards john _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Chris Lalancette
2009-Apr-22  14:24 UTC
Re: [Xen-devel] [PATCH] Fix xenconsole after aborted save
Keir Fraser wrote:> On 22/04/2009 10:49, "Chris Lalancette" <clalance@redhat.com> wrote: > >> Keir Fraser wrote: >>> Another thing to note is I think this problem can only occur if the domU >>> does not support suspend cancellation (advertised as SUSPEND_CANCEL in >>> kernel elf notes -- see xen/xend/XendDomainInfo.py:resumeDomain()). Your >>> kernels should support that feature -- suspend cancellation (a.k.a. Resume) >>> is very likely to be hit-or-miss without it! >> Could you elaborate a bit on this? I was under the impression that suspend >> cancellation was there mostly for the netaccel bits, but I have to admit I >> didn''t look at it very closely. What scenarios do the suspend cancel bits >> help? > > Suspend failures (failure to save or to migrate). Also for live > checkpointing/snapshotting. The feature indicates that the guest is happy > for the suspend hypercall to return indicating ''failure/cancelled'' and in > which case it can pretty much resume whatever it was doing without any of > the usual resume logic. The alternative is for the toolstack to make it look > like the domain has been restored/migrated, by resetting PV devices and the > like, and this is not much tested and almost inherently fragile. > > The suspend_cancel hook you are thinking of is for any PV devices which *do* > need to know that a suspend was cancelled. Netaccel does for some reason > which I cannot recall. > > Anyway, it is pretty easy and pretty important to support SUSPEND_CANCEL!OK, that makes sense. Thanks Keir! -- Chris Lalancette _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Neil Turton
2009-Apr-22  16:15 UTC
Re: [Xen-devel] [PATCH] Fix xenconsole after aborted save
Keir Fraser wrote:> The suspend_cancel hook you are thinking of is for any PV devices which *do* > need to know that a suspend was cancelled. Netaccel does for some reason > which I cannot recall.For the record, netaccel needs to remove any mappings of the hardware during the suspend callback since those mappings are not valid after the suspend has completed. Those hardware mappings (or the mappings appropriate for the new machine) are restored on resume as a result of the standard netfront/netback connection being re-established. If the suspend is cancelled, the netfront/netback connection stays in place but the hardware mappings need to be re-established which is why the suspend_cancel callback is used. Cheers, Neil. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel