Hi, I am new to Xen working. I am not able to save newly created domain ( Xen 2.0.5) . Following is log of /var/log/xend-debug.log: Xfrd>connectionLost> [Failure instance: Traceback: twisted.internet.error.Connecti onLost, Connection to the other side was lost in a non-clean fashion. ] XfrdSaveInfo>connectionLost> [Failure instance: Traceback: twisted.internet.error.ConnectionLost, Connection to the other side was lost in a non-clean fashion. ] XfrdInfo>connectionLost> [Failure instance: Traceback: twisted.internet.error.ConnectionLost, Connection to the other side was lost in a non-clean fashion. ] Error> save failed Error> calling errback ***cbremove> [Failure instance: Traceback: xen.xend.XendError.XendError, save failed ] ***_delete_session> 3 clientConnectionLost> connector= <twisted.internet.tcp.Connector instance at 0xb78bc42c> reason= [Failure instance: Traceback: twisted.internet.error.ConnectionLost, Connection to the other side was lost in a non-clean fashion. I am using Debian sarge as my base system. Reading from previous threads in the list I have tried following things: libcurl.so.4 is installed. libcrypto.so.4 is linked to libcrypto.so.0.9.7. libssl.so.4 is linked to libssl.so.0.9.7. I have also tried "xend restart" and getting same message after saving when i type "xfrd" on command line, it hangs ( nothing happens, I have to get back to prompt control-C). Since save is not working correctly, so migration is also giving same problem. PLEASE HELP. In reference to previous similar thread: http://lists.xensource.com/archives/html/xen-devel/2004-10/msg00524.html "xm migrate 4 test" results in following error message Error: Error: [Failure instance: Traceback: xen.xend.XendError.XendError, migrate failed PLEASE HELP -john _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
> I am new to Xen working. I am not able to save newly created > domain ( Xen 2.0.5) . Following is log of /var/log/xend-debug.log:Please upgrade. Ian _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Hi , Upgrading xen to 2.0.7, I am able to save my virtual machines . But restoring virtual machine give me error. Here is my sequence of event : xm save 10 test10 xm restore test10 Error: errors: transfer daemon (xfrd) error: 22 xm list does show newly restored domain Name Id Mem(MB) CPU State Time(s) Console Domain-0 0 92 0 r---- 58.5 ttylinux 11 256 0 -b--- 0.8 9610 xm list -l gives following output ( No console information ) . I am not able to go to restored virtual machine. (domain (id 0) (name Domain-0) (memory 92) (maxmem -1) (state r----) (cpu 0) (cpu_time 65.475292894) (devices) ) (domain (id 11) (name Domain-11) (memory 255) (maxmem 256) (state -b---) (cpu 0) (cpu_time 0.008907453) (devices) XFRD Log for restoration is as follows ( tail -f /var/log/xfrd.log ): Reloading memory pages: 6% 12% 17% 23% 28% 34% 39% 45% 50% 56% 62% 67% 73% 78% 84% 89% 95% 95%[1126619778.790150] Received all pages Received all pages 100% 100% [1126619778.808769] Memory reloaded. Memory reloaded. Decreased reservation by 8 pages [1126619778.810456] Domain ready to be built. Domain ready to be built. [1126619778.810637] Domain ready to be unpaused Domain ready to be unpaused [1126619778.811179] DOM=11 DOM=11 2437 [ERR] XFRD> Error adding op field. 2437 [INF] XFRD> Xfr service err=-22 PLEASE help. Regards, John On 9/13/05, Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> wrote:> > > I am new to Xen working. I am not able to save newly created > > domain ( Xen 2.0.5) . Following is log of /var/log/xend-debug.log: > > Please upgrade. > Ian > > >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
The xfrd.log might give you more info. Alternatively you could try 2.0-testing. This has a couple of extra minor fixes and has been battle tested with many thousands of migrations under load. Ian> -----Original Message----- > From: john bryant [mailto:bryant.johan@gmail.com] > Sent: 13 September 2005 21:57 > To: Ian Pratt > Cc: xen-users@lists.xensource.com; ian.pratt@cl.cam.ac.uk > Subject: Re: [Xen-users] xen save error > > Hi , > > Upgrading xen to 2.0.7, I am able to save my virtual machines > . But restoring virtual machine give me error. Here is my > sequence of event : > > xm save 10 test10 > xm restore test10 > Error: errors: transfer daemon (xfrd) error: 22 > > xm list does show newly restored domain > > Name Id Mem(MB) CPU State Time(s) Console > Domain-0 0 92 0 r---- 58.5 > ttylinux 11 256 0 -b--- 0.8 9610 > > xm list -l gives following output ( No console information ) > . I am not able to go to restored virtual machine. > > (domain > (id 0) > (name Domain-0) > (memory 92) > (maxmem -1) > (state r----) > (cpu 0) > (cpu_time 65.475292894) > (devices) > ) > (domain > (id 11) > (name Domain-11) > (memory 255) > (maxmem 256) > (state -b---) > (cpu 0) > (cpu_time 0.008907453) > (devices) > > XFRD Log for restoration is as follows ( tail -f /var/log/xfrd.log ): > > Reloading memory pages: 6% > 12% > 17% > 23% > 28% > 34% > 39% > 45% > 50% > 56% > 62% > 67% > 73% > 78% > 84% > 89% > 95% > 95%[1126619778.790150] Received all pages > > Received all pages > 100% > > 100% > [1126619778.808769] Memory reloaded. > > Memory reloaded. > Decreased reservation by 8 pages > [1126619778.810456] Domain ready to be built. > > Domain ready to be built. > [1126619778.810637] Domain ready to be unpaused > > Domain ready to be unpaused > [1126619778.811179] DOM=11 > > DOM=11 > 2437 [ERR] XFRD> Error adding op field. > 2437 [INF] XFRD> Xfr service err=-22 > > PLEASE help. > > Regards, > John > > > > On 9/13/05, Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> wrote: > > > I am new to Xen working. I am not able to save newly created > > domain ( Xen 2.0.5) . Following is log of > /var/log/xend-debug.log: > > Please upgrade. > Ian > > > > > >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
I can get xen-2.0-testing to fail on live migrations with virtually no load about 10% of the time with live migration :-) Seems it becomes unable to suspend the user domain kernel - kernel gets the message, but never gets a chance to process it. I''m not saying 2.0-testing won''t resolve the problem John is seeing, but I''m not sure I would quite make the statement that it has been ''battle tested'' :-) -- Ray -----Original Message----- From: xen-users-bounces@lists.xensource.com [mailto:xen-users-bounces@lists.xensource.com]On Behalf Of Ian Pratt Sent: Wednesday, September 14, 2005 7:04 AM To: bryant.johan@gmail.com Cc: ian.pratt@cl.cam.ac.uk; xen-users@lists.xensource.com Subject: RE: [Xen-users] xen save error The xfrd.log might give you more info. Alternatively you could try 2.0-testing. This has a couple of extra minor fixes and has been battle tested with many thousands of migrations under load. Ian> -----Original Message----- > From: john bryant [mailto:bryant.johan@gmail.com] > Sent: 13 September 2005 21:57 > To: Ian Pratt > Cc: xen-users@lists.xensource.com; ian.pratt@cl.cam.ac.uk > Subject: Re: [Xen-users] xen save error > > Hi , > > Upgrading xen to 2.0.7, I am able to save my virtual machines > . But restoring virtual machine give me error. Here is my > sequence of event : > > xm save 10 test10 > xm restore test10 > Error: errors: transfer daemon (xfrd) error: 22 > > xm list does show newly restored domain > > Name Id Mem(MB) CPU State Time(s) Console > Domain-0 0 92 0 r---- 58.5 > ttylinux 11 256 0 -b--- 0.8 9610 > > xm list -l gives following output ( No console information ) > . I am not able to go to restored virtual machine. > > (domain > (id 0) > (name Domain-0) > (memory 92) > (maxmem -1) > (state r----) > (cpu 0) > (cpu_time 65.475292894) > (devices) > ) > (domain > (id 11) > (name Domain-11) > (memory 255) > (maxmem 256) > (state -b---) > (cpu 0) > (cpu_time 0.008907453) > (devices) > > XFRD Log for restoration is as follows ( tail -f /var/log/xfrd.log ): > > Reloading memory pages: 6% > 12% > 17% > 23% > 28% > 34% > 39% > 45% > 50% > 56% > 62% > 67% > 73% > 78% > 84% > 89% > 95% > 95%[1126619778.790150] Received all pages > > Received all pages > 100% > > 100% > [1126619778.808769] Memory reloaded. > > Memory reloaded. > Decreased reservation by 8 pages > [1126619778.810456] Domain ready to be built. > > Domain ready to be built. > [1126619778.810637] Domain ready to be unpaused > > Domain ready to be unpaused > [1126619778.811179] DOM=11 > > DOM=11 > 2437 [ERR] XFRD> Error adding op field. > 2437 [INF] XFRD> Xfr service err=-22 > > PLEASE help. > > Regards, > John > > > > On 9/13/05, Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> wrote: > > > I am new to Xen working. I am not able to save newly created > > domain ( Xen 2.0.5) . Following is log of > /var/log/xend-debug.log: > > Please upgrade. > Ian > > > > > >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
>I can get xen-2.0-testing to fail on live migrations with virtually noload about 10% of the time with live migration :-) Seems it becomes>unable to suspend the user domain kernel - kernel gets the message, but >never gets a chance to process it. I''m not saying 2.0-testing won''t >resolve the problem John is seeing, but I''m not sure I would quite make >the statement that it has been ''battle tested'' :-)So as the one who did the ''battle testing'' I should speak up :^) I did test tens of thousands of save/restore/migrations, but on 2.0.7-patch i.e. on a version of the 2.0.7 tree with the save/restore fixes applied. This was not checked in directly to 2.0.7 since we aim to filter stuff through 2.0-testing and make a release that way. So it''s extremely interesting to us if you have observed errors in 2.0-testing since this might mean that something else is to blame (e.g. the upgrade from 2.6.11 to 2.6.12). This will delay our release of 2.0.8 which we were hoping to do RSN. Can you possibly post some more details about what you observe? And what is in xend.log / xend-debug.log / xfrd.log? Is there a relatively small test case that allows you to reproduce? thanks, S. (p.s. as to the OP, I''m pretty sure the error he was seeing is fixed in both 2.0.7 and 2.0-testing) _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Hi, I am able to save my virtual machines in Xen 2.0.7. But restoring virtual machine give me error Here is my sequence of event : xm save 10 test10 xm restore test10 Error: errors: transfer daemon (xfrd) error: 22 xm list does show newly restored domain Name Id Mem(MB) CPU State Time(s) Console Domain-0 0 92 0 r---- 58.5 ttylinux 11 256 0 -b--- 0.8 9610 xm list -l gives following output ( No console information ) . I am not able to go to restored virtual machine. (domain (id 0) (name Domain-0) (memory 92) (maxmem -1) (state r----) (cpu 0) (cpu_time 65.475292894) (devices) ) (domain (id 11) (name Domain-11) (memory 255) (maxmem 256) (state -b---) (cpu 0) (cpu_time 0.008907453) (devices) XFRD Log for restoration is as follows ( tail -f /var/log/xfrd.log ): Reloading memory pages: 6% 12% 17% 23% 28% 34% 39% 45% 50% 56% 62% 67% 73% 78% 84% 89% 95% 95%[1126619778.790150] Received all pages Received all pages 100% 100% [1126619778.808769] Memory reloaded. Memory reloaded. Decreased reservation by 8 pages [1126619778.810456] Domain ready to be built. Domain ready to be built. [1126619778.810637] Domain ready to be unpaused Domain ready to be unpaused [1126619778.811179] DOM=11 DOM=11 2437 [ERR] XFRD> Error adding op field. 2437 [INF] XFRD> Xfr service err=-22 -John PS: Solution to all problem is refered as upgrading xen 2.0.5 to xen 2.0.7. Further upgrading may also solve the problem. It is my request that developers please explain the root cause of problem and solution that has been used to fix. It is benefiical for xen development as well as all new users. Also, there is need to update document substantially. On 9/14/05, Steven Hand <Steven.Hand@cl.cam.ac.uk> wrote:> > > >I can get xen-2.0-testing to fail on live migrations with virtually no > load about 10% of the time with live migration :-) Seems it becomes > >unable to suspend the user domain kernel - kernel gets the message, but > >never gets a chance to process it. I''m not saying 2.0-testing won''t > >resolve the problem John is seeing, but I''m not sure I would quite make > >the statement that it has been ''battle tested'' :-) > > So as the one who did the ''battle testing'' I should speak up :^) > > I did test tens of thousands of save/restore/migrations, but on > 2.0.7-patch > i.e. on a version of the 2.0.7 tree with the save/restore fixes applied. > This was not checked in directly to 2.0.7 since we aim to filter stuff > through 2.0-testing and make a release that way. > > So it''s extremely interesting to us if you have observed errors in > 2.0-testing > since this might mean that something else is to blame (e.g. the upgrade > from > 2.6.11 to 2.6.12). This will delay our release of 2.0.8 which we were > hoping > to do RSN. > > Can you possibly post some more details about what you observe? And what > is > in xend.log / xend-debug.log / xfrd.log? Is there a relatively small test > case that allows you to reproduce? > > > thanks, > > S. > > (p.s. as to the OP, I''m pretty sure the error he was seeing is fixed in > both 2.0.7 and 2.0-testing) >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
>I am able to save my virtual machines in Xen 2.0.7. But restoring virtual >machine give me error >Here is my sequence of event : > >xm save 10 test10 >xm restore test10 > >Error: errors: transfer daemon (xfrd) error: 22 > >xm list does show newly restored domain > >Name Id Mem(MB) CPU State Time(s) Console >Domain-0 0 92 0 r---- 58.5=20 >ttylinux 11 256 0 -b--- 0.8 9610What''s your config to the domain in question? In particular, what is the kernel that you are using?>DOM=11 >2437 [ERR] XFRD> Error adding op field. >2437 [INF] XFRD> Xfr service err=-22I''ve never seen this before so wonder if it''s related to the kernel you are using... cheers, S. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
There is a thread in here called "Live migration problem" that has a bunch of information. What I was able to narrow it down, via printk''s in the kernel, is the kernel receives a message indicating it needs to suspend and adds to the work queue. However the kernel never executes the work to suspend and xen gives up waiting for it to happen. The last thing I see is the kernel calls scheduled_work The xfrd.log indicated it was unable to suspend. There was nothing more really in those logs. I''ll try to find some time today to re-run the tests again. I recently put my 3 machines back to the base 2.0.7 (no ''testing'') so I''ll need to put two of them back to 2.0.7 testing. FWIW, this was using the 2.6.11.12 kernel for the user domain with everything else being 2.0.7 testing. http://lists.xensource.com/archives/html/xen-users/2005-08/msg00626.html will get you to the thread. -- Ray ---------------------------------------- _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
I downloaded the latest 2.0 testing install image. I ran two different tests between the two machines: 1) xen-2.0-testing.gz hypervisor, 2.6.12-xenU domain 0, 2.4.30-xen0 user kernel. I migrated around 15 times no problem. 2) xen-2.0-testing.gz hypervisor, 2.6.12-xenU domain 0, 2.6.12-xen0 user kernel. Migrated back and forth about 4 or 5 times and got the failure. xfrd.log on the originating host: [1126799373.757442] Saving memory pages: iter 2 0% 2: sent 473, skipped 15, 2 0% 2: sent 473, skipped 15, delta 171ms, dom0 22%, target 77%, sent 90Mb/s, dirtied 10Mb/s 54 pages [1126799373.929002] Saving memory pages: iter 3 0% 3: sent 32, skipped 15, 3 0% 3: sent 32, skipped 15, [DEBUG] Conn_sxpr> (xfr.err 22)[DEBUG] Conn_sxpr< err=0 Retry suspend domain (0) Retry suspend domain (0) ... Retry suspend domain (0) Retry suspend domain (0) Retry suspend domain (0) Unable to suspend domain. (0) Unable to suspend domain. (0) Domain appears not to have suspended: 0 Domain appears not to have suspended: 0 7913 [WRN] XFRD> Transfer errors: 7913 [WRN] XFRD> state=XFR_STATE err=1 7913 [INF] XFRD> Xfr service err=1 xend.log [2005-09-15 10:49:13 xend] DEBUG (blkif:203) Connecting blkif to event channel <BlkifBackendInterface 17 0> ports=14:4 [2005-09-15 10:49:15 xend] INFO (XendMigrate:323) Migrate BEGIN: [''migrate'', [''id'', ''3''], [''state'', ''begin''], [''live'', 1], [''resource'', 0], [''src'', [''host'', ''rlinuxas4.bmc.com''], [''domain'', ''17'']], [''dst'', [''host'', ''raylap'']]] [2005-09-15 10:49:15 xend] INFO (XendRoot:113) EVENT> xend.domain.migrate [''rayfed4'', ''17'', ''begin'', [''migrate'', [''id'', ''3''], [''state'', ''begin''], [''live'', 1], [''resource'', 0], [''src'', [''host'', ''rlinuxas4.bmc.com''], [''domain'', ''17'']], [''dst'', [''host'', ''raylap'']]]] [2005-09-15 10:49:33 xend] DEBUG (XendDomain:490) domain_restart_schedule> 17 suspend 1 [2005-09-15 10:49:33 xend] INFO (XendRoot:113) EVENT> xend.domain.shutdown [''rayfed4'', ''17'', ''suspend''] [2005-09-15 10:49:43 xend] ERROR (SrvBase:162) op=migrate: errors: suspend failed, Callback timed out Traceback (most recent call last): File "/usr/lib/python2.3/site-packages/twisted/internet/defer.py", line 308, in _startRunCallbacks self.timeoutCall.cancel() File "/usr/lib/python2.3/site-packages/twisted/internet/base.py", line 82, in cancel raise error.AlreadyCalled AlreadyCalled: Tried to cancel an already-called event. [2005-09-15 10:49:45 xend] INFO (XendDomain:571) Destroying domain: name=rayfed4 [2005-09-15 10:49:45 xend] DEBUG (XendDomainInfo:665) Destroying vifs for domain 17 [2005-09-15 10:49:45 xend] DEBUG (netif:305) Destroying vif domain=17 vif=0 [2005-09-15 10:49:46 xend] DEBUG (XendDomainInfo:674) Destroying vbds for domain 17 [2005-09-15 10:49:46 xend] DEBUG (blkif:552) Destroying blkif domain=17 [2005-09-15 10:49:46 xend] DEBUG (blkif:408) Destroying vbd domain=17 idx=0 [2005-09-15 10:49:46 xend] DEBUG (blkif:408) Destroying vbd domain=17 idx=1 [2005-09-15 10:49:46 xend] DEBUG (XendDomainInfo:634) Closing console, domain 17 [2005-09-15 10:49:46 xend] DEBUG (XendDomainInfo:622) Closing channel to domain 17 [2005-09-15 10:49:46 xend] INFO (XendMigrate:349) Migrate ERROR: [''migrate'', [''id'', ''3''], [''state'', ''error''], [''live'', 1], [''resource'', 0], [''src'', [''host'', ''rlinuxas4.bmc.com''], [''domain'', ''17'']], [''dst'', [''host'', ''raylap'']]] [2005-09-15 10:49:46 xend] INFO (XendRoot:113) EVENT> xend.domain.destroy [''rayfed4'', ''17''] [2005-09-15 10:49:46 xend] INFO (XendRoot:113) EVENT> xend.domain.migrate [''rayfed4'', ''17'', ''error'', [''migrate'', [''id'', ''3''], [''state'', ''error''], [''live'', 1], [''resource'', 0], [''src'', [''host'', ''rlinuxas4.bmc.com''], [''domain'', ''17'']], [''dst'', [''host'', ''raylap'']]]] [2005-09-15 10:49:46 xend] DEBUG (blkif:363) Unbinding vbd (type file) from /dev/loop0 [2005-09-15 10:49:46 xend] DEBUG (blkif:363) Unbinding vbd (type file) from /dev/loop1 [2005-09-15 10:49:46 xend] INFO (XendRoot:113) EVENT> xend.domain.died [''rayfed4'', ''17''] _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
>There is a thread in here called "Live migration problem" that has a >bunch of information. What I was able to narrow it down, via printk''s >in the kernel, is the kernel receives a message indicating it needs to >suspend and adds to the work queue. However the kernel never executes >the work to suspend and xen gives up waiting for it to happen. The last >thing I see is the kernel calls scheduled_work=20 > >The xfrd.log indicated it was unable to suspend. There was nothing more >really in those logs. I''ll try to find some time today to re-run the >tests again. I recently put my 3 machines back to the base 2.0.7 (no >''testing'') so I''ll need to put two of them back to 2.0.7 testing. FWIW, >this was using the 2.6.11.12 kernel for the user domain with everything >else being 2.0.7 testing. > >http://lists.xensource.com/archives/html/xen-users/2005-08/msg00626.html >will get you to the thread.yes - though this doesn''t seem to be the same error; in particular the "[ERR] XFRD> Error adding op field" is the bit I was referring to never having seen before... cheers, S. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Right, I don''t think it is at all related to the xen save error. I was just reporting a separate issue with live migration after the comment had been made that migration had gone through combat :-) -- Ray -----Original Message----- From: Steven Hand [mailto:Steven.Hand@cl.cam.ac.uk] Sent: Thursday, September 15, 2005 11:05 AM To: Cole, Ray Cc: Steven Hand; Ian Pratt; bryant.johan@gmail.com; xen-users@lists.xensource.com; Steven.Hand@cl.cam.ac.uk Subject: Re: [Xen-users] xen save error>There is a thread in here called "Live migration problem" that has a >bunch of information. What I was able to narrow it down, via printk''s >in the kernel, is the kernel receives a message indicating it needs to >suspend and adds to the work queue. However the kernel never executes >the work to suspend and xen gives up waiting for it to happen. The last >thing I see is the kernel calls scheduled_work=20 > >The xfrd.log indicated it was unable to suspend. There was nothing more >really in those logs. I''ll try to find some time today to re-run the >tests again. I recently put my 3 machines back to the base 2.0.7 (no >''testing'') so I''ll need to put two of them back to 2.0.7 testing. FWIW, >this was using the 2.6.11.12 kernel for the user domain with everything >else being 2.0.7 testing. > >http://lists.xensource.com/archives/html/xen-users/2005-08/msg00626.html >will get you to the thread.yes - though this doesn''t seem to be the same error; in particular the "[ERR] XFRD> Error adding op field" is the bit I was referring to never having seen before... cheers, S. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
> I can get xen-2.0-testing to fail on live migrations with > virtually no load about 10% of the time with live migration > :-) Seems it becomes unable to suspend the user domain > kernel - kernel gets the message, but never gets a chance to > process it. I''m not saying 2.0-testing won''t resolve the > problem John is seeing, but I''m not sure I would quite make > the statement that it has been ''battle tested'' :-)Can you say more about your configuration? I haven''t heard of migrate problems on 2.0-testing. Almost all the development effort is focussed on 3.0, but if it''s a reproduceable problem someone might take a look. Migration on 2.0-testing has been tested pretty thoroughly, so it must be something to do with your configuration or other xm operations you''ve done on the domain since you started it. Ian _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Sure. I''ve got 2 machines where I''ve installed 2.0-testing from xen-2.0-testing-install.tar.gz. I downloaded it this morning to make sure I had the latest. Ran install.sh. I made sure grub is pointing to xen-2.0.gz, which is in turn a symbolic link to xen-2.0-testing.gz. I did a depmod for 2.6.12-xen0 and xenU and created initrd''s for both. Also modified grub to use the 2.6.12-xen0 kernel with it and rebooted. uname -a confirmed I''m using 2.6.12-xen0. Domain 0 was orignally a RedHat AS 4.0 installation (''minimal installation'' selected). I then copied a Fedora Core 4 installation image to an NFS mount location. Also created a swap file (have tried with and without) on the NFS link. I created the .cfg file for the domain - nothing special about it. /Domains/t is where the NFS mount is made. The .cfg is later in this email. The FC4 image has had /lib/tls renamed to tls.disabled, although I still get a warning when booting the user domain that I''ve got /lib/tls. I don''t know, maybe the initrd has it. Anyway...from I started xend/xfrd. I start up the FC4 domain using 2.6.12-xenU. This domain uses autofs extensively (all /home entries are automounted) and NIS. I log in to the user domain using a remote xterm (ssh into the domain, start xterm). I then start ''top'' so I can see that the domain is still alive. I then do: xm migrate --live rayfed4 {new_machine} back and forth between the two machines that have identical Xen 2.0 Testing installations. I can generally go back and forth about 4 or 5 times before one of the migrate commands tells me it had an error (can''t suspend). I had at one time put printk''s into the user kernel (after downloading the 2.0 testing source, of course..) and confirmed that the kernel receives the message to suspend, but the suspend work the kernel schedules never gets executed. I wait about 10 seconds between migration attempts. Below is my .cfg: # -*- mode: python; -*- #===========================================================================# Python configuration setup for ''xm create''. # This script sets the parameters used when a domain is created using ''xm create''. # You use a separate script for each domain you want to create, or # you can set the parameters for the domain on the xm command line. #=========================================================================== #---------------------------------------------------------------------------- # Kernel image file. kernel = "/boot/vmlinuz-2.6.12-xenU" # Optional ramdisk. ramdisk = "/boot/initrd-2.6.12-xenU.img" # The domain build function. Default is ''linux''. #builder=''linux'' # Initial memory allocation (in megabytes) for the new domain. memory = 192 # A name for your domain. All domains must have different names. name = "rayfed4" # Which CPU to start domain on? #cpu = -1 # leave to Xen to pick #---------------------------------------------------------------------------- # Define network interfaces. # Number of network interfaces. Default is 1. #nics=1 # Optionally define mac and/or bridge for the network interfaces. # Random MACs are assigned if not given. #vif = [ ''mac=aa:00:00:00:00:11, bridge=xen-br0'' ] vif = [ ''mac=52:54:00:12:34:56'' ] #---------------------------------------------------------------------------- # Define the disk devices you want the domain to have access to, and # what you want them accessible as. # Each disk entry is of the form phy:UNAME,DEV,MODE # where UNAME is the device, DEV is the device name the domain will see, # and MODE is r for read-only, w for read-write. #disk = [ ''file:/dev/md2,md2,w'' ] #disk = [ ''file:/dev/md3,sda1,w'', ''file:/dev/md4,sda2,w'' ] disk = [ ''file:/Domains/t/Fed4.img,sda1,w'', ''file:/Domains/t/Fed4Swap.img,sda2,w'' ] # Set root device. root = "/dev/sda1 ro" #nfs_root = ''/full/path/to/root/directory'' # Sets runlevel 4. extra = "3" #---------------------------------------------------------------------------- # Set according to whether you want the domain restarted when it exits. # The default is ''onreboot'', which restarts the domain when it shuts down # with exit code reboot. # Other values are ''always'', and ''never''. #restart = ''onreboot'' #=========================================================================== -----Original Message----- From: Ian Pratt [mailto:m+Ian.Pratt@cl.cam.ac.uk] Sent: Thursday, September 15, 2005 1:04 PM To: Cole, Ray; bryant.johan@gmail.com Cc: xen-users@lists.xensource.com; ian.pratt@cl.cam.ac.uk Subject: RE: [Xen-users] xen save error> I can get xen-2.0-testing to fail on live migrations with > virtually no load about 10% of the time with live migration > :-) Seems it becomes unable to suspend the user domain > kernel - kernel gets the message, but never gets a chance to > process it. I''m not saying 2.0-testing won''t resolve the > problem John is seeing, but I''m not sure I would quite make > the statement that it has been ''battle tested'' :-)Can you say more about your configuration? I haven''t heard of migrate problems on 2.0-testing. Almost all the development effort is focussed on 3.0, but if it''s a reproduceable problem someone might take a look. Migration on 2.0-testing has been tested pretty thoroughly, so it must be something to do with your configuration or other xm operations you''ve done on the domain since you started it. Ian _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
BTW, back when I was using printk''s in the user kernel to determine if it was getting the message to suspend or not, I found it really odd that I could remove the line in the kernel that responds to the receipt of the suspend command (keeping the one that says "I have suspended") and it actually works better - usually going 12 to 20 migrations between failures. I was somewhat surprised that removing the response that the message was receive would work. I would have figured something would have been waiting on the receipt of that message. This makes me wonder if there is some timing issue going on where the kernel is told to suspend and then the domain stops getting CPU time before it is able to complete suspending. -- Ray -----Original Message----- From: Cole, Ray Sent: Thursday, September 15, 2005 1:29 PM To: ''Ian Pratt''; bryant.johan@gmail.com Cc: xen-users@lists.xensource.com; ian.pratt@cl.cam.ac.uk Subject: RE: [Xen-users] xen save error Sure. I''ve got 2 machines where I''ve installed 2.0-testing from xen-2.0-testing-install.tar.gz. I downloaded it this morning to make sure I had the latest. Ran install.sh. I made sure grub is pointing to xen-2.0.gz, which is in turn a symbolic link to xen-2.0-testing.gz. I did a depmod for 2.6.12-xen0 and xenU and created initrd''s for both. Also modified grub to use the 2.6.12-xen0 kernel with it and rebooted. uname -a confirmed I''m using 2.6.12-xen0. Domain 0 was orignally a RedHat AS 4.0 installation (''minimal installation'' selected). I then copied a Fedora Core 4 installation image to an NFS mount location. Also created a swap file (have tried with and without) on the NFS link. I created the .cfg file for the domain - nothing special about it. /Domains/t is where the NFS mount is made. The .cfg is later in this email. The FC4 image has had /lib/tls renamed to tls.disabled, although I still get a warning when booting the user domain that I''ve got /lib/tls. I don''t know, maybe the initrd has it. Anyway...from I started xend/xfrd. I start up the FC4 domain using 2.6.12-xenU. This domain uses autofs extensively (all /home entries are automounted) and NIS. I log in to the user domain using a remote xterm (ssh into the domain, start xterm). I then start ''top'' so I can see that the domain is still alive. I then do: xm migrate --live rayfed4 {new_machine} back and forth between the two machines that have identical Xen 2.0 Testing installations. I can generally go back and forth about 4 or 5 times before one of the migrate commands tells me it had an error (can''t suspend). I had at one time put printk''s into the user kernel (after downloading the 2.0 testing source, of course..) and confirmed that the kernel receives the message to suspend, but the suspend work the kernel schedules never gets executed. I wait about 10 seconds between migration attempts. Below is my .cfg: # -*- mode: python; -*- #===========================================================================# Python configuration setup for ''xm create''. # This script sets the parameters used when a domain is created using ''xm create''. # You use a separate script for each domain you want to create, or # you can set the parameters for the domain on the xm command line. #=========================================================================== #---------------------------------------------------------------------------- # Kernel image file. kernel = "/boot/vmlinuz-2.6.12-xenU" # Optional ramdisk. ramdisk = "/boot/initrd-2.6.12-xenU.img" # The domain build function. Default is ''linux''. #builder=''linux'' # Initial memory allocation (in megabytes) for the new domain. memory = 192 # A name for your domain. All domains must have different names. name = "rayfed4" # Which CPU to start domain on? #cpu = -1 # leave to Xen to pick #---------------------------------------------------------------------------- # Define network interfaces. # Number of network interfaces. Default is 1. #nics=1 # Optionally define mac and/or bridge for the network interfaces. # Random MACs are assigned if not given. #vif = [ ''mac=aa:00:00:00:00:11, bridge=xen-br0'' ] vif = [ ''mac=52:54:00:12:34:56'' ] #---------------------------------------------------------------------------- # Define the disk devices you want the domain to have access to, and # what you want them accessible as. # Each disk entry is of the form phy:UNAME,DEV,MODE # where UNAME is the device, DEV is the device name the domain will see, # and MODE is r for read-only, w for read-write. #disk = [ ''file:/dev/md2,md2,w'' ] #disk = [ ''file:/dev/md3,sda1,w'', ''file:/dev/md4,sda2,w'' ] disk = [ ''file:/Domains/t/Fed4.img,sda1,w'', ''file:/Domains/t/Fed4Swap.img,sda2,w'' ] # Set root device. root = "/dev/sda1 ro" #nfs_root = ''/full/path/to/root/directory'' # Sets runlevel 4. extra = "3" #---------------------------------------------------------------------------- # Set according to whether you want the domain restarted when it exits. # The default is ''onreboot'', which restarts the domain when it shuts down # with exit code reboot. # Other values are ''always'', and ''never''. #restart = ''onreboot'' #=========================================================================== -----Original Message----- From: Ian Pratt [mailto:m+Ian.Pratt@cl.cam.ac.uk] Sent: Thursday, September 15, 2005 1:04 PM To: Cole, Ray; bryant.johan@gmail.com Cc: xen-users@lists.xensource.com; ian.pratt@cl.cam.ac.uk Subject: RE: [Xen-users] xen save error> I can get xen-2.0-testing to fail on live migrations with > virtually no load about 10% of the time with live migration > :-) Seems it becomes unable to suspend the user domain > kernel - kernel gets the message, but never gets a chance to > process it. I''m not saying 2.0-testing won''t resolve the > problem John is seeing, but I''m not sure I would quite make > the statement that it has been ''battle tested'' :-)Can you say more about your configuration? I haven''t heard of migrate problems on 2.0-testing. Almost all the development effort is focussed on 3.0, but if it''s a reproduceable problem someone might take a look. Migration on 2.0-testing has been tested pretty thoroughly, so it must be something to do with your configuration or other xm operations you''ve done on the domain since you started it. Ian _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
>I''ve got 2 machines where I''ve installed 2.0-testing from >xen-2.0-testing-install.tar.gz. I downloaded it this morning to make >sure I had the latest. Ran install.sh. > >I made sure grub is pointing to xen-2.0.gz, which is in turn a symbolic >link to xen-2.0-testing.gz. I did a depmod for 2.6.12-xen0 and xenU and >created initrd''s for both. Also modified grub to use the 2.6.12-xen0 >kernel with it and rebooted. uname -a confirmed I''m using 2.6.12-xen0. >Domain 0 was orignally a RedHat AS 4.0 installation (''minimal >installation'' selected). > >I then copied a Fedora Core 4 installation image to an NFS mount >location. Also created a swap file (have tried with and without) on the >NFS link. I created the .cfg file for the domain - nothing special about >it. /Domains/t is where the NFS mount is made. The .cfg is later in >this email. > >The FC4 image has had /lib/tls renamed to tls.disabled, although I still >get a warning when booting the user domain that I''ve got /lib/tls. I >don''t know, maybe the initrd has it. > >Anyway...from I started xend/xfrd. I start up the FC4 domain using >2.6.12-xenU. This domain uses autofs extensively (all /home entries are >automounted) and NIS. I log in to the user domain using a remote xterm >(ssh into the domain, start xterm). I then start ''top'' so I can see >that the domain is still alive. > >I then do: > > xm migrate --live rayfed4 {new_machine} > >back and forth between the two machines that have identical Xen 2.0 >Testing installations. I can generally go back and forth about 4 or 5 >times before one of the migrate commands tells me it had an error (can''t >suspend). I had at one time put printk''s into the user kernel (after >downloading the 2.0 testing source, of course..) and confirmed that the >kernel receives the message to suspend, but the suspend work the kernel >schedules never gets executed. I wait about 10 seconds between >migration attempts.Ok - thanks for the info. I''ve just installed a fresh 2.0-testing (built from source) on a test box and run about 35 (and counting) live migrations to localhost with no observed errors (I am also ssh''d into the guest and running top). My root fs is centos 4.1 for both dom0 and the guest, although I''m using a local lvm volume for the guest rather than a loopback file over NFS... hmm... I''ll try to get another machine grooved so can test it across the network.... cheers, S. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
If there is any additional probing around you want me to do let me know. I''m willing to help get more information. -- Ray -----Original Message----- From: Steven Hand [mailto:Steven.Hand@cl.cam.ac.uk] Sent: Thursday, September 15, 2005 2:09 PM To: Cole, Ray Cc: Ian Pratt; bryant.johan@gmail.com; ian.pratt@cl.cam.ac.uk; xen-users@lists.xensource.com; Steven.Hand@cl.cam.ac.uk Subject: Re: [Xen-users] xen save error>I''ve got 2 machines where I''ve installed 2.0-testing from >xen-2.0-testing-install.tar.gz. I downloaded it this morning to make >sure I had the latest. Ran install.sh. > >I made sure grub is pointing to xen-2.0.gz, which is in turn a symbolic >link to xen-2.0-testing.gz. I did a depmod for 2.6.12-xen0 and xenU and >created initrd''s for both. Also modified grub to use the 2.6.12-xen0 >kernel with it and rebooted. uname -a confirmed I''m using 2.6.12-xen0. >Domain 0 was orignally a RedHat AS 4.0 installation (''minimal >installation'' selected). > >I then copied a Fedora Core 4 installation image to an NFS mount >location. Also created a swap file (have tried with and without) on the >NFS link. I created the .cfg file for the domain - nothing special about >it. /Domains/t is where the NFS mount is made. The .cfg is later in >this email. > >The FC4 image has had /lib/tls renamed to tls.disabled, although I still >get a warning when booting the user domain that I''ve got /lib/tls. I >don''t know, maybe the initrd has it. > >Anyway...from I started xend/xfrd. I start up the FC4 domain using >2.6.12-xenU. This domain uses autofs extensively (all /home entries are >automounted) and NIS. I log in to the user domain using a remote xterm >(ssh into the domain, start xterm). I then start ''top'' so I can see >that the domain is still alive. > >I then do: > > xm migrate --live rayfed4 {new_machine} > >back and forth between the two machines that have identical Xen 2.0 >Testing installations. I can generally go back and forth about 4 or 5 >times before one of the migrate commands tells me it had an error (can''t >suspend). I had at one time put printk''s into the user kernel (after >downloading the 2.0 testing source, of course..) and confirmed that the >kernel receives the message to suspend, but the suspend work the kernel >schedules never gets executed. I wait about 10 seconds between >migration attempts.Ok - thanks for the info. I''ve just installed a fresh 2.0-testing (built from source) on a test box and run about 35 (and counting) live migrations to localhost with no observed errors (I am also ssh''d into the guest and running top). My root fs is centos 4.1 for both dom0 and the guest, although I''m using a local lvm volume for the guest rather than a loopback file over NFS... hmm... I''ll try to get another machine grooved so can test it across the network.... cheers, S. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
>If there is any additional probing around you want me to do let me know. > I''m willing to help get more information.Thanks - one interesting fact would be what workload (if any) is running within your guest? I''ve now managed to repro it about 3 times (out of probably a thousand live relocations) so it''s definitely rarer for me... but one possible common cause is that the guest was ''quiet'' just prior to the hang (I have a workload script that alternates beween disk I/O, cpu load, network load + sleep with random timeouts). As you had identified, the domain is not entering __do_shutdown() at all in the ''hang'' case which points to some relatively low level issue at iether ctrl_if or xen level. Have an instrumented system now and will rerun overnight ; hopefully some useful data will come out in the morning. cheers, S. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
My guest is usually very quiet - running only ''top''. I''m VERY glad you were able to reproduce it. It does seem to be very timing-dependent. -- Ray -----Original Message----- From: Steven Hand [mailto:Steven.Hand@cl.cam.ac.uk] Sent: Thursday, September 15, 2005 11:31 PM To: Cole, Ray Cc: Steven Hand; Ian Pratt; bryant.johan@gmail.com; ian.pratt@cl.cam.ac.uk; xen-users@lists.xensource.com; Steven.Hand@cl.cam.ac.uk Subject: Re: [Xen-users] xen save error>If there is any additional probing around you want me to do let me know. > I''m willing to help get more information.Thanks - one interesting fact would be what workload (if any) is running within your guest? I''ve now managed to repro it about 3 times (out of probably a thousand live relocations) so it''s definitely rarer for me... but one possible common cause is that the guest was ''quiet'' just prior to the hang (I have a workload script that alternates beween disk I/O, cpu load, network load + sleep with random timeouts). As you had identified, the domain is not entering __do_shutdown() at all in the ''hang'' case which points to some relatively low level issue at iether ctrl_if or xen level. Have an instrumented system now and will rerun overnight ; hopefully some useful data will come out in the morning. cheers, S. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users