Hi, I''ve been scanning the list and seen reports on problems with live migration. Thought I might add a bit more entropy. I try to do a live migration in the same physical host, i.e. xm migrate --live ''whatever'' localhost It fails with ''Error: errors: suspend, failed, Callbak timed out''. It seems like transfer of memory pages works until the point when the domain needs to be suspended to do the final transfer. Funny thing is it used to work before, gloriously, and I haven''t made any software/hardware changes. At some point a xm save command failed with timeout, and from there on live migration fails with this message. Non-live migration works perfectly, also between different physical hosts. save/restore also works flawlessly. For the record, I use nfsroot. I attached xfrd.log. I can post some other stuff, just ask Thanks a lot Andres xfrd.log: (xfr.migrate 6 "(domain (id 6) (name AndresNfsDomain) (memory 511) (maxmem 524288) (state -b---) (cpu 1) (cpu_time 0.10838393) (up_time 27.1105668545) (start_time 1115999325.85) (console (status listening) (id 12) (domain 6) (local_port 12) (remote_port 1) (console_port 9606)) (devices (vif (idx 0) (vif 0) (mac 00:80:84:00:00:11) (vifname vif6.0) (evtchn 13 3) (index 0))) (config (vm (name AndresNfsDomain) (memory 512) (image (linux (kernel /boot/vmlinuz-2.6.11-xenU) (ip 192.168.70.45:192.168.70.106:192.168.70.254:255.255.255.0:virtuality:eth0:off) (root /dev/nfs) (args ''nfsroot=192.168.70.106:/mnt/nfs2,rsize=32768,wsize=32768 4''))) (device (vif (mac 00:80:84:00:00:11))))))" localhost 8002 1 0)[DEBUG] Conn_sxpr< err=0 [DEBUG] Conn_connect> addr=127.0.0.1:8002 [DEBUG] Conn_init> flags=1 [DEBUG] Conn_init> write stream... [DEBUG] stream_init>mode=w flags=1 compress=0 [DEBUG] stream_init> unbuffer... [DEBUG] stream_init< err=0 [DEBUG] Conn_init> read stream... [DEBUG] stream_init>mode=r flags=1 compress=0 [DEBUG] stream_init> unbuffer... [DEBUG] stream_init< err=0 [DEBUG] Conn_sxpr> (xfr.err 0)[DEBUG] Conn_sxpr< err=0 [1115999352.965314] xc_linux_save start 6 xc_linux_save start 6 [1115999352.966265] Saving memory pages: iter 1 0% Saving memory pages: iter 1 0%4344 [INF] XFRD> Xfr service for 127.0.0.1:54931 [DEBUG] Conn_init> flags=1 [DEBUG] Conn_init> write stream... [DEBUG] stream_init>mode=w flags=1 compress=0 [DEBUG] stream_init> unbuffer... [DEBUG] stream_init< err=0 [DEBUG] Conn_init> read stream... [DEBUG] stream_init>mode=r flags=1 compress=0 [DEBUG] stream_init> unbuffer... [DEBUG] stream_init< err=0 [DEBUG] Conn_sxpr> (xfr.hello 1 0)[DEBUG] Conn_sxpr< err=0 [DEBUG] Conn_sxpr> (xfr.xfr 6)[DEBUG] Conn_sxpr< err=0 [1115999352.971066] xc_linux_restore start xc_linux_restore start [1115999352.991648] Created domain 7 Created domain 7 [1115999353.003196] Reloading memory pages: 0% Reloading memory pages: 5% 5% 10% 10% 10%FNI 765 : [1000007e,1020] pte=00bec063, mfn=00000bec, pfn=ffffffff [mfn]=deadbeef 15% 15% 20% 20% 25% 25% 30% 30% 35% 35% 40% 40% 45% 45% 50% 50% 55% 55% 60% 60% 65% 65% 70% 70% 75% 75% 80% 80% 85% 85% 90% 90% 95% 95% 1: sent 130824, skipped 243, 1: sent 130824, skipped 243, delta 2629ms, dom0 100%, target 71%, sent 1630Mb/s, dirtied 4Mb/s 321 pages [1115999355.596112] Saving memory pages: iter 2 0% 2: sent 320, skipped 0, 2 0% 2: sent 320, skipped 0, delta 11ms, dom0 0%, target 100%, sent 953Mb/s, dirtied 35Mb/s 12 pages [1115999355.607606] Saving memory pages: iter 3 0% 3: sent 12, skipped 0, r 3 0% 3: sent 12, skipped100% 100%[DEBUG] Conn_sxpr> (xfr.err 22)[DEBUG] Conn_sxpr< err=0 Retry suspend domain (120) #... This repeats 198 times in total ...# Retry suspend domain (120) Unable to suspend domain. (120) Unable to suspend domain. (120) Domain appears not to have suspended: 120 Domain appears not to have suspended: 120 4343 [WRN] XFRD> Transfer errors: 4343 [WRN] XFRD> state=XFR_STATE err=1 4343 [INF] XFRD> Xfr service err=1 Error when reading from state file Error when reading from state file 4344 [INF] XFRD> Xfr service err=1 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On May 13, 2005, at 20:07, Andres Lagar Cavilla wrote: Andres,> I try to do a live migration in the same physical host, i.e. xm > migrate --live ''whatever'' localhost > It fails with ''Error: errors: suspend, failed, Callbak timed out''. > It seems like transfer of memory pages works until the point when the > domain needs to be suspended to do the final transfer. Funny thing is > it used to work before, gloriously, and I haven''t made any > software/hardware changes. At some point a xm save command failed with > timeout, and from there on live migration fails with this message. > Non-live migration works perfectly, also between different physical > hosts. save/restore also works flawlessly.I had similar timeout errors previously, when I was using a bit slower servers. I overcame the problem by slightly increasing the timeout value in controller.py. It seemed to provide a remedy. Teemu -- _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Teemu saves the day!!! I actually set the timeout to 100 for no particular reason (originally it was 10, 20 didn''t work either) Thanks Ian for your suggestion as well Cheers!! Andres At 02:45 PM 5/13/2005, Teemu Koponen wrote:>On May 13, 2005, at 20:07, Andres Lagar Cavilla wrote: > >Andres, > >>I try to do a live migration in the same physical host, i.e. xm migrate >>--live ''whatever'' localhost >>It fails with ''Error: errors: suspend, failed, Callbak timed out''. >>It seems like transfer of memory pages works until the point when the >>domain needs to be suspended to do the final transfer. Funny thing is it >>used to work before, gloriously, and I haven''t made any software/hardware >>changes. At some point a xm save command failed with timeout, and from >>there on live migration fails with this message. Non-live migration works >>perfectly, also between different physical hosts. save/restore also works >>flawlessly. > >I had similar timeout errors previously, when I was using a bit slower >servers. I overcame the problem by slightly increasing the timeout value >in controller.py. It seemed to provide a remedy. > >Teemu > >--_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Thanks Teemu and Ian for the help. Now I wonder if somebody can enlighten me on why this problem arises. Seems that live migration times out while waiting for the domain to suspend, actually before the last round of page transfers starts (which should be a relatively small transfer, according to the NSDI submission). Any insights? Thanks a lot Andres At 02:45 PM 5/13/2005, Teemu Koponen wrote:>On May 13, 2005, at 20:07, Andres Lagar Cavilla wrote: > >Andres, > >>I try to do a live migration in the same physical host, i.e. xm migrate >>--live ''whatever'' localhost >>It fails with ''Error: errors: suspend, failed, Callbak timed out''. >>It seems like transfer of memory pages works until the point when the >>domain needs to be suspended to do the final transfer. Funny thing is it >>used to work before, gloriously, and I haven''t made any software/hardware >>changes. At some point a xm save command failed with timeout, and from >>there on live migration fails with this message. Non-live migration works >>perfectly, also between different physical hosts. save/restore also works >>flawlessly. > >I had similar timeout errors previously, when I was using a bit slower >servers. I overcame the problem by slightly increasing the timeout value >in controller.py. It seemed to provide a remedy. > >Teemu > >--_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Teemu saves the day!!! > I actually set the timeout to 100 for no particular reason > (originally it was 10, 20 didn''t work either) Thanks Ian for > your suggestion as wellI''d be really surprised if increasing the timeout actually made a difference. Are you sure you''re not just using the shadow mode fix that was checked in a couple of hours ago? Best, Ian> Cheers!! > Andres > At 02:45 PM 5/13/2005, Teemu Koponen wrote: > >On May 13, 2005, at 20:07, Andres Lagar Cavilla wrote: > > > >Andres, > > > >>I try to do a live migration in the same physical host, i.e. xm > >>migrate --live ''whatever'' localhost It fails with ''Error: errors: > >>suspend, failed, Callbak timed out''. > >>It seems like transfer of memory pages works until the > point when the > >>domain needs to be suspended to do the final transfer. > Funny thing is > >>it used to work before, gloriously, and I haven''t made any > >>software/hardware changes. At some point a xm save command > failed with > >>timeout, and from there on live migration fails with this message. > >>Non-live migration works perfectly, also between different physical > >>hosts. save/restore also works flawlessly. > > > >I had similar timeout errors previously, when I was using a > bit slower > >servers. I overcame the problem by slightly increasing the timeout > >value in controller.py. It seemed to provide a remedy. > > > >Teemu > > > >-- > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
It''s unlikely that I caught that fix on my last code pull, more than 3 hours ago. However, live migration didn''t work and after tweaking the timeout it did work, so the timeout solution deserves some credence. I''ll make sure I get that fix, recompile, re-run and report. Thanks Andres At 05:14 PM 5/13/2005, Ian Pratt wrote:> > > > Teemu saves the day!!! > > I actually set the timeout to 100 for no particular reason > > (originally it was 10, 20 didn''t work either) Thanks Ian for > > your suggestion as well > >I''d be really surprised if increasing the timeout actually made a >difference. Are you sure you''re not just using the shadow mode fix that >was checked in a couple of hours ago? > >Best, >Ian > > > Cheers!! > > Andres > > At 02:45 PM 5/13/2005, Teemu Koponen wrote: > > >On May 13, 2005, at 20:07, Andres Lagar Cavilla wrote: > > > > > >Andres, > > > > > >>I try to do a live migration in the same physical host, i.e. xm > > >>migrate --live ''whatever'' localhost It fails with ''Error: errors: > > >>suspend, failed, Callbak timed out''. > > >>It seems like transfer of memory pages works until the > > point when the > > >>domain needs to be suspended to do the final transfer. > > Funny thing is > > >>it used to work before, gloriously, and I haven''t made any > > >>software/hardware changes. At some point a xm save command > > failed with > > >>timeout, and from there on live migration fails with this message. > > >>Non-live migration works perfectly, also between different physical > > >>hosts. save/restore also works flawlessly. > > > > > >I had similar timeout errors previously, when I was using a > > bit slower > > >servers. I overcame the problem by slightly increasing the timeout > > >value in controller.py. It seemed to provide a remedy. > > > > > >Teemu > > > > > >-- > > > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Ian, I got a fresh code image this morning. Live migration works fine, even after un-tweaking the timer back to its default value. I have tested, not necessarily thoroughly, but I haven''t run into trouble yet. I guess this closes this chapter. For whatever it may be worth, I have some comments regarding the "previous" (Friday May 13) xfrd version: - Even though timeout increase would allow live migration to complete succesfully this was not always the case; there was actually a 50% chance of success. - On all successful migrations, the number of skipped pages after the last iteration and before domain suspend was always zero: Saving memory pages: iter 3 0% 3: sent 0, skipped 0, 3: sent 0, skipped 0, [DEBUG] Conn_sxpr> (AndresNfsDomain 8)[DEBUG] Conn_sxpr< err=0 [1116255361.997192] SUSPEND flags 00020004 shinfo 00000beb eip c01068fe esi 0002de60 - On all failed migrations, there was a nonzero number of said skipped pages (sometimes 12, sometimes 4) Hope this somehow helps. Keep up the excellent work Andres Ian Pratt wrote:> > > >>Teemu saves the day!!! >>I actually set the timeout to 100 for no particular reason >>(originally it was 10, 20 didn''t work either) Thanks Ian for >>your suggestion as well >> > >I''d be really surprised if increasing the timeout actually made a difference. Are you sure you''re not just using the shadow mode fix that was checked in a couple of hours ago? > >Best, >Ian > > >>Cheers!! >>Andres >>At 02:45 PM 5/13/2005, Teemu Koponen wrote: >> >>>On May 13, 2005, at 20:07, Andres Lagar Cavilla wrote: >>> >>>Andres, >>> >>> >>>>I try to do a live migration in the same physical host, i.e. xm >>>>migrate --live ''whatever'' localhost It fails with ''Error: errors: >>>>suspend, failed, Callbak timed out''. >>>>It seems like transfer of memory pages works until the >>>> >>point when the >> >>>>domain needs to be suspended to do the final transfer. >>>> >>Funny thing is >> >>>>it used to work before, gloriously, and I haven''t made any >>>>software/hardware changes. At some point a xm save command >>>> >>failed with >> >>>>timeout, and from there on live migration fails with this message. >>>>Non-live migration works perfectly, also between different physical >>>>hosts. save/restore also works flawlessly. >>>> >>>I had similar timeout errors previously, when I was using a >>> >>bit slower >> >>>servers. I overcame the problem by slightly increasing the timeout >>>value in controller.py. It seemed to provide a remedy. >>> >>>Teemu >>> >>>-- >>> >> >>_______________________________________________ >>Xen-devel mailing list >>Xen-devel@lists.xensource.com >>http://lists.xensource.com/xen-devel >> >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian, In case you need more feedback, the shadow code fix seems to have cleared up my (chronic) live migration problems under 2.0-testing too. Thanks for your attention to this matter. -J Andres Lagar Cavilla wrote:> Hi Ian, > I got a fresh code image this morning. Live migration works fine, even > after un-tweaking the timer back to its default value. I have tested, > not necessarily thoroughly, but I haven''t run into trouble yet. I guess > this closes this chapter._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> In case you need more feedback, the shadow code fix seems to > have cleared up my (chronic) live migration problems under > 2.0-testing too. > Thanks for your attention to this matter.Thanks for the feedback. It was an evil little typo that took mny hours to hunt down. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel