I am using xen-testing and trying to migrate a VM between two machines. However, the migrate command fails with the error "Error: errors: suspend failed, Callback timed out". After the error on the source machine, while the VM shows up on xm list at the destination, xfrd.log (on the destination) shows that its trying to reload memory pages beyond 100%. The number of memory pages reloaded keeps on going up until I use ''xm destroy''. The process of xm save/scp <saved image>/xm restore the machine works fine though. Any ideas why live migration would not work? The combined usage of dom0 and domU is around half the physical memory present in the machine. The traceback from xend.log Traceback (most recent call last): File "/usr/lib/python2.3/site-packages/twisted/internet/defer.py", line 308, in _startRunCallbacks self.timeoutCall.cancel() File "/usr/lib/python2.3/site-packages/twisted/internet/base.py", line 82, in cancel raise error.AlreadyCalled AlreadyCalled: Tried to cancel an already-called event. xfrd.log on the sender is full of "Retry suspend domain (120)" before it says "Unable to suspend domain. (120)" and "Domain appears not to have suspended: 120". Niraj -- http://www.cs.cmu.edu/~ntolia _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
> I am using xen-testing and trying to migrate a VM between two > machines. However, the migrate command fails with the error "Error: > errors: suspend failed, Callback timed out".It might be worth trying Charles Coffing''s patch to the iostream handling. I''m planing on applying it when I get a chance.> After the error on the source machine, while the VM shows up > on xm list at the destination, xfrd.log (on the destination) > shows that its trying to reload memory pages beyond 100%. The > number of memory pages reloaded keeps on going up until I use > ''xm destroy''.Going beyond 100% is normal behaviour, but obviously it should terminate eventually, after doing a number of iterations. Posting the xfrd log (after applying the patch) would be interesting. Best, Ian> The process of xm save/scp <saved image>/xm restore the > machine works fine though. Any ideas why live migration would > not work? The combined usage of dom0 and domU is around half > the physical memory present in the machine. > > The traceback from xend.log > > Traceback (most recent call last): > File "/usr/lib/python2.3/site-packages/twisted/internet/defer.py", > line 308, in _startRunCallbacks > self.timeoutCall.cancel() > File "/usr/lib/python2.3/site-packages/twisted/internet/base.py", > line 82, in cancel > raise error.AlreadyCalled > AlreadyCalled: Tried to cancel an already-called event. > > xfrd.log on the sender is full of "Retry suspend domain > (120)" before it says "Unable to suspend domain. (120)" and > "Domain appears not to have suspended: 120". > > Niraj > > -- > http://www.cs.cmu.edu/~ntolia > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Niraj Tolia
2005-Apr-12 20:49 UTC
Re: [Xen-users] Problems with live migrate on xen-testing
On Apr 8, 2005 6:20 AM, Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> wrote:> > I am using xen-testing and trying to migrate a VM between two > > machines. However, the migrate command fails with the error "Error: > > errors: suspend failed, Callback timed out". > > It might be worth trying Charles Coffing''s patch to the iostream > handling. I''m planing on applying it when I get a chance. >Hi Ian, The patch did not make a difference. I do have a few more data points though. Irrespective of whether the patch is applied, migration without the --live switch works. Further, even --live seems to work if all the memory pages are copied in one iteration. However, if xfrd.log shows that a second iteration has been started, live migration will fail.> > After the error on the source machine, while the VM shows up > > on xm list at the destination, xfrd.log (on the destination) > > shows that its trying to reload memory pages beyond 100%. The > > number of memory pages reloaded keeps on going up until I use > > ''xm destroy''. > > Going beyond 100% is normal behaviour, but obviously it should terminate > eventually, after doing a number of iterations. > > Posting the xfrd log (after applying the patch) would be interesting. >Applying the patch did not make a difference. However, reducing the amount of memory provided to the migrated domain changes the error message. While the live migration still fails, it does not start trying to keep on loading pages. It now fails with a message like "Frame number in type 1 page table is out of range". The xfrd logs from the sender and receiver are attached for a 512 and 256 MB domain configuration. Also, a few other smaller things. a) On doing a migrate (without --live), the ''xm migrate'' command does not return control to the shell even after a successful migration. A Control-C gives the following trace Traceback (most recent call last): File "/usr/sbin/xm", line 9, in ? main.main(sys.argv) File "/usr/lib/python/xen/xm/main.py", line 808, in main xm.main(args) File "/usr/lib/python/xen/xm/main.py", line 106, in main self.main_call(args) File "/usr/lib/python/xen/xm/main.py", line 124, in main_call p.main(args[1:]) File "/usr/lib/python/xen/xm/main.py", line 309, in main migrate.main(args) File "/usr/lib/python/xen/xm/migrate.py", line 49, in main server.xend_domain_migrate(dom, dst, opts.vals.live, opts.vals.resource) File "/usr/lib/python/xen/xend/XendClient.py", line 249, in xend_domain_migrate {''op'' : ''migrate'', File "/usr/lib/python/xen/xend/XendClient.py", line 148, in xendPost return self.client.xendPost(url, data) File "/usr/lib/python/xen/xend/XendProtocol.py", line 79, in xendPost return self.xendRequest(url, "POST", args) File "/usr/lib/python/xen/xend/XendProtocol.py", line 143, in xendRequest resp = conn.getresponse() File "/usr/lib/python2.3/httplib.py", line 778, in getresponse response.begin() File "/usr/lib/python2.3/httplib.py", line 273, in begin version, status, reason = self._read_status() File "/usr/lib/python2.3/httplib.py", line 231, in _read_status line = self.fp.readline() File "/usr/lib/python2.3/socket.py", line 323, in readline data = recv(1) b) I noticed that if the sender migrates a VM (without --live) and has a console attached to the domain, CPU utilization hits 100% after migration until the console is disconnected. Niraj> Best, > Ian > > > The process of xm save/scp <saved image>/xm restore the > > machine works fine though. Any ideas why live migration would > > not work? The combined usage of dom0 and domU is around half > > the physical memory present in the machine. > > > > The traceback from xend.log > > > > Traceback (most recent call last): > > File "/usr/lib/python2.3/site-packages/twisted/internet/defer.py", > > line 308, in _startRunCallbacks > > self.timeoutCall.cancel() > > File "/usr/lib/python2.3/site-packages/twisted/internet/base.py", > > line 82, in cancel > > raise error.AlreadyCalled > > AlreadyCalled: Tried to cancel an already-called event. > > > > xfrd.log on the sender is full of "Retry suspend domain > > (120)" before it says "Unable to suspend domain. (120)" and > > "Domain appears not to have suspended: 120". > > > > Niraj > > > > -- > > http://www.cs.cmu.edu/~ntolia > > > > _______________________________________________ > > Xen-users mailing list > > Xen-users@lists.xensource.com > > http://lists.xensource.com/xen-users > > >-- http://www.cs.cmu.edu/~ntolia _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
OK, I''ve managed to reproduce this: Under 2.0.5 I can make a domain crash in the same fashion if it is under extreme network receive load during the migration. This used to work fine, so we''ve obviously introduced a bug in the last few months. I''ll investigate when I get a chance. We seem to get stuck in a page fault loop writing to an skb''s shared info area, passing through the vmalloc fault section of do_page_fault. It looks like the PTE is read only, which is very odd. We just need to figure out how it got that way. This smells like the first real Xen-internal bug in the stable series for several months... Ian> The patch did not make a difference. I do have a few more > data points though. Irrespective of whether the patch is > applied, migration without the --live switch works. Further, > even --live seems to work if all the memory pages are copied > in one iteration. However, if xfrd.log shows that a second > iteration has been started, live migration will fail. > > > > After the error on the source machine, while the VM shows > up on xm > > > list at the destination, xfrd.log (on the destination) shows that > > > its trying to reload memory pages beyond 100%. The number > of memory > > > pages reloaded keeps on going up until I use ''xm destroy''. > > > > Going beyond 100% is normal behaviour, but obviously it should > > terminate eventually, after doing a number of iterations. > > > > Posting the xfrd log (after applying the patch) would be > interesting. > > > > Applying the patch did not make a difference. However, > reducing the amount of memory provided to the migrated domain > changes the error message. While the live migration still > fails, it does not start trying to keep on loading pages. It > now fails with a message like "Frame number in type 1 page > table is out of range". > > The xfrd logs from the sender and receiver are attached for a 512 and > 256 MB domain configuration. > > > Also, a few other smaller things. > > a) On doing a migrate (without --live), the ''xm migrate'' > command does not return control to the shell even after a > successful migration. A Control-C gives the following trace > > Traceback (most recent call last): > File "/usr/sbin/xm", line 9, in ? > main.main(sys.argv) > File "/usr/lib/python/xen/xm/main.py", line 808, in main > xm.main(args) > File "/usr/lib/python/xen/xm/main.py", line 106, in main > self.main_call(args) > File "/usr/lib/python/xen/xm/main.py", line 124, in main_call > p.main(args[1:]) > File "/usr/lib/python/xen/xm/main.py", line 309, in main > migrate.main(args) > File "/usr/lib/python/xen/xm/migrate.py", line 49, in main > server.xend_domain_migrate(dom, dst, opts.vals.live, > opts.vals.resource) > File "/usr/lib/python/xen/xend/XendClient.py", line 249, in > xend_domain_migrate > {''op'' : ''migrate'', > File "/usr/lib/python/xen/xend/XendClient.py", line 148, in xendPost > return self.client.xendPost(url, data) > File "/usr/lib/python/xen/xend/XendProtocol.py", line 79, > in xendPost > return self.xendRequest(url, "POST", args) > File "/usr/lib/python/xen/xend/XendProtocol.py", line 143, > in xendRequest > resp = conn.getresponse() > File "/usr/lib/python2.3/httplib.py", line 778, in getresponse > response.begin() > File "/usr/lib/python2.3/httplib.py", line 273, in begin > version, status, reason = self._read_status() > File "/usr/lib/python2.3/httplib.py", line 231, in _read_status > line = self.fp.readline() > File "/usr/lib/python2.3/socket.py", line 323, in readline > data = recv(1) > > > b) I noticed that if the sender migrates a VM (without > --live) and has a console attached to the domain, CPU > utilization hits 100% after migration until the console is > disconnected. > > Niraj > > > Best, > > Ian > > > > > The process of xm save/scp <saved image>/xm restore the machine > > > works fine though. Any ideas why live migration would not > work? The > > > combined usage of dom0 and domU is around half the > physical memory > > > present in the machine. > > > > > > The traceback from xend.log > > > > > > Traceback (most recent call last): > > > File > "/usr/lib/python2.3/site-packages/twisted/internet/defer.py", > > > line 308, in _startRunCallbacks > > > self.timeoutCall.cancel() > > > File > "/usr/lib/python2.3/site-packages/twisted/internet/base.py", > > > line 82, in cancel > > > raise error.AlreadyCalled > > > AlreadyCalled: Tried to cancel an already-called event. > > > > > > xfrd.log on the sender is full of "Retry suspend domain (120)" > > > before it says "Unable to suspend domain. (120)" and > "Domain appears > > > not to have suspended: 120". > > > > > > Niraj > > > > > > -- > > > http://www.cs.cmu.edu/~ntolia > > > > > > _______________________________________________ > > > Xen-users mailing list > > > Xen-users@lists.xensource.com > > > http://lists.xensource.com/xen-users > > > > > > > > -- > http://www.cs.cmu.edu/~ntolia >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users