Hi, I am using Xen-2.0.7 on a Dual Intel Xeon 2.8GHz system with 4GB of ram. I am using 2.6.11 as kernel for my domain 0. Domain 0 uses Debian Sarge with a backported Xen 2.0.7 package (only litte changes to the debian 2.0.6 package, nothing important enough to get metioned). All kernels were compiled against vanilla kernels with xen-patch. The domain U''s are using 2.6.11 or 2.4.30 (debian, suse). I have no problems within domains and everything is running very smoothly, exepct one thing (which was also not working correctly in xen-2.0.6 for me): I can save a domain with "xm save <domainname> <suspendfile>" once and I can restore this domain again, but if I try a second "xm save ..." it simply seems to hang. Nothing happens and the last thing in the logs are these lines: ==> /var/log/xend.log <=[2005-08-15 20:12:27 xend] INFO (XendMigrate:380) Save BEGIN: [''save'', [''id'', ''1''], [''state'', ''begin''], [''domain'', ''5''], [''file'', ''/suspend/vm-ralph'']] [2005-08-15 20:12:27 xend] INFO (XendRoot:113) EVENT> xend.domain.save [''vm-ralph'', ''5'', ''begin'', [''save'', [''id'', ''1''], [''state'', ''begin''], [''domain'', ''5''], [''file'', ''/suspend/vm-ralph'']]] ==> /var/log/xfrd.log <=3808 [INF] XFRD> Accepted connection from 127.0.0.1:3905 on 2 4165 [INF] XFRD> Xfr service for 127.0.0.1:3905 [DEBUG] Conn_init> flags=1 [DEBUG] Conn_init> write stream... [DEBUG] stream_init>mode=w flags=1 compress=0 [DEBUG] stream_init> unbuffer... [DEBUG] stream_init< err=0 [DEBUG] Conn_init> read stream... [DEBUG] stream_init>mode=r flags=1 compress=0 [DEBUG] stream_init> unbuffer... [DEBUG] stream_init< err=0 [DEBUG] Conn_sxpr> (xfr.hello 1 0)[DEBUG] Conn_sxpr< err=0 [DEBUG] Conn_sxpr> (xfr.save 5 "(domain (id 5) (name vm-ralph) (memory 127) (maxmem 128) (state -b---) (cpu 3) (cpu_time 1.583158713) (up_time 1401.25794005) (start_time 1124128146.12) (console (status listening) (id 12) (domain 5) (local_port 12) (remote_port 1) (console_port 9605)) (devices (vif (idx 0) (vif 0) (mac aa:00:00:00:00:22) (vifname vif5.0) (ip 212.79.XXX.XXX/32) (evtchn 17 4) (index 0)) (vbd (idx 0) (vdev 2049) (device 65030) (mode w) (dev sda1) (uname phy:xen-volumes/vm-ralph) (node xen-volumes/vm-ralph) (index 0)) (vbd (idx 1) (vdev 2050) (device 65031) (mode w) (dev sda2) (uname phy:xen-volumes/swap-ralph) (node xen-volumes/swap-ralph) (index 1))) (config (vm (name vm-ralph) (memory 128) (cpu 3) (image (linux (kernel /boot/xen-linux-2.6.11-domu-tops1) (ramdisk /boot/xen-linux-2.6.11-domu-tops1-modules) (root ''/dev/sda1 ro''))) (device (vbd (uname phy:xen-volumes/vm-ralph) (dev sda1) (mode w))) (device (vbd (uname phy:xen-volumes/swap-ralph) (dev sda2) (mode w))) (device (vif (mac aa:00:00:00:00:22) (ip 212.79.XXX.XXX/32))))))" /suspend/vm-ralph) [DEBUG] Conn_sxpr< err=0 [1124129547.387983] xc_linux_save start 5 xc_linux_save start 5 I can strace the "xm save" process, but there is not much acction: xen:/var/log# ps fax |grep xm 4164 pts/0 S+ 0:00 | \_ python /usr/sbin/xm save vm-ralph /suspend/vm-ralph xen:/var/log# strace -p 4164 Process 4164 attached - interrupt to quit recv(3, Even an xfrd thrad seems to be spawned, but there is more or less the same as in the xm save process: xen:/var/log# ps fax |grep xfrd 3808 ? S 0:00 xfrd 4165 ? SL 0:00 \_ xfrd xen:/var/log# strace -p 4165 Process 4165 attached - interrupt to quit read(3, I can press ctrl-c and the "xm save" aborts with the following error (I waited over 3min): Traceback (most recent call last): File "/usr/sbin/xm", line 9, in ? main.main(sys.argv) File "/usr/lib/python2.3/site-packages/xen/xm/main.py", line 808, in main xm.main(args) File "/usr/lib/python2.3/site-packages/xen/xm/main.py", line 106, in main self.main_call(args) File "/usr/lib/python2.3/site-packages/xen/xm/main.py", line 124, in main_call p.main(args[1:]) File "/usr/lib/python2.3/site-packages/xen/xm/main.py", line 276, in main server.xend_domain_save(dom, savefile) File "/usr/lib/python2.3/site-packages/xen/xend/XendClient.py", line 244, in xend_domain_save {''op'' : ''save'', File "/usr/lib/python2.3/site-packages/xen/xend/XendClient.py", line 148, in xendPost return self.client.xendPost(url, data) File "/usr/lib/python2.3/site-packages/xen/xend/XendProtocol.py", line 79, in xendPost return self.xendRequest(url, "POST", args) File "/usr/lib/python2.3/site-packages/xen/xend/XendProtocol.py", line 143, in xendRequest resp = conn.getresponse() File "/usr/lib/python2.3/httplib.py", line 781, in getresponse response.begin() File "/usr/lib/python2.3/httplib.py", line 273, in begin version, status, reason = self._read_status() File "/usr/lib/python2.3/httplib.py", line 231, in _read_status line = self.fp.readline() File "/usr/lib/python2.3/socket.py", line 323, in readline data = recv(1) KeyboardInterrupt After that it doesn''t matter if I shutdown and recreate the domain before I try to save the domain for the second time. It happens every time after the first successfull save&restore. Sometimes even on the first "xm save" attempt. It even seems that xen let''s the "half-saved" domain in a broken state, because I cannot shutdown the domain correctly after the second "xm save" attempt. I can ssh into it and type "halt" and it shutdowns, but xen (xm list) still things that the domain is running. even a xm destroy <domainname> doesn''t help. I have to reboot the phy. machine to get the domain working correctly. Because this should get a production system very soon I would appreciate help very much. More information (like xm dmesg) available on request... ;-PP --Ralph _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
>I am using Xen-2.0.7 on a Dual Intel Xeon 2.8GHz system with 4GB of ram. I am >using 2.6.11 as kernel for my domain 0. Domain 0 uses Debian Sarge with a >backported Xen 2.0.7 package (only litte changes to the debian 2.0.6 package, >nothing important enough to get metioned). All kernels were compiled against >vanilla kernels with xen-patch. The domain U''s are using 2.6.11 or 2.4.30 >(debian, suse). > >I have no problems within domains and everything is running very smoothly, >exepct one thing (which was also not working correctly in xen-2.0.6 for me): >I can save a domain with "xm save <domainname> <suspendfile>" once and I can >restore this domain again, but if I try a second "xm save ..." it simply >seems to hang. Nothing happens and the last thing in the logs are these >lines:Is this the same with both 2.4 and 2.6 domUs? I''ve noticed something similar with 2.0.7 but only with 2.4 domUs ... it would be useful to know if it affects 2.6 also - I''m trying to track it down. cheers, S. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Steven Hand wrote:>>I am using Xen-2.0.7 on a Dual Intel Xeon 2.8GHz system with 4GB of ram. I am >>using 2.6.11 as kernel for my domain 0. Domain 0 uses Debian Sarge with a >>backported Xen 2.0.7 package (only litte changes to the debian 2.0.6 package, >>nothing important enough to get metioned). All kernels were compiled against >>vanilla kernels with xen-patch. The domain U''s are using 2.6.11 or 2.4.30 >>(debian, suse). >> >>I have no problems within domains and everything is running very smoothly, >>exepct one thing (which was also not working correctly in xen-2.0.6 for me): >>I can save a domain with "xm save <domainname> <suspendfile>" once and I can >>restore this domain again, but if I try a second "xm save ..." it simply >>seems to hang. Nothing happens and the last thing in the logs are these >>lines: >> >> > >Is this the same with both 2.4 and 2.6 domUs? I''ve noticed something similar >with 2.0.7 but only with 2.4 domUs ... it would be useful to know if it >affects 2.6 also - I''m trying to track it down. > >There''s a very similiar problem in 3.0 that has to do with a race condition with the xc_save/Xend interaction. xc_save thinks it has sent the "suspend" command over the pipe and Xend is waiting for it to arrive. xc_save wasn''t back-ported to 2.0.7 right? Looking into it right now. Regards, Anthony Liguori>cheers, > >S. > >_______________________________________________ >Xen-users mailing list >Xen-users@lists.xensource.com >http://lists.xensource.com/xen-users > > >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Am Montag, 15. August 2005 23:29 schrieb Anthony Liguori:> Steven Hand wrote: > >>I am using Xen-2.0.7 on a Dual Intel Xeon 2.8GHz system with 4GB of ram. > >> I am using 2.6.11 as kernel for my domain 0. Domain 0 uses Debian Sarge > >> with a backported Xen 2.0.7 package (only litte changes to the debian > >> 2.0.6 package, nothing important enough to get metioned). All kernels > >> were compiled against vanilla kernels with xen-patch. The domain U''s are > >> using 2.6.11 or 2.4.30 (debian, suse). > >> > >>I have no problems within domains and everything is running very > >> smoothly, exepct one thing (which was also not working correctly in > >> xen-2.0.6 for me): I can save a domain with "xm save <domainname> > >> <suspendfile>" once and I can restore this domain again, but if I try a > >> second "xm save ..." it simply seems to hang. Nothing happens and the > >> last thing in the logs are these lines: > > > >Is this the same with both 2.4 and 2.6 domUs? I''ve noticed something > > similar with 2.0.7 but only with 2.4 domUs ... it would be useful to know > > if it affects 2.6 also - I''m trying to track it down.yes, it''s the same with 2.4 and 2.6 domUs...> There''s a very similiar problem in 3.0 that has to do with a race > condition with the xc_save/Xend interaction. xc_save thinks it has sent > the "suspend" command over the pipe and Xend is waiting for it to arrive.... but after some more testing I noticed another interessting thing. "xm save" hangs if the suspend file doesn''t exist. For the first time after a dom0 reboot it''s normaly no problem, but if I delete the file and try a "xm save" again it will not work for 95%. If I keep the save-file and then make a "xm save" and a "xm restore" it seems to be no problem. I made 10 tests and all worked. regards, --Ralph> xc_save wasn''t back-ported to 2.0.7 right? Looking into it right now. > > Regards, > > Anthony Liguori > > >cheers, > > > >S. > > > >_______________________________________________ > >Xen-users mailing list > >Xen-users@lists.xensource.com > >http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
>Am Montag, 15. August 2005 23:29 schrieb Anthony Liguori: >> Steven Hand wrote: >> >>I am using Xen-2.0.7 on a Dual Intel Xeon 2.8GHz system with 4GB of ram. >> >> I am using 2.6.11 as kernel for my domain 0. Domain 0 uses Debian Sarge >> >> with a backported Xen 2.0.7 package (only litte changes to the debian >> >> 2.0.6 package, nothing important enough to get metioned). All kernels >> >> were compiled against vanilla kernels with xen-patch. The domain U''s are >> >> using 2.6.11 or 2.4.30 (debian, suse). >> >> >> >>I have no problems within domains and everything is running very >> >> smoothly, exepct one thing (which was also not working correctly in >> >> xen-2.0.6 for me): I can save a domain with "xm save <domainname> >> >> <suspendfile>" once and I can restore this domain again, but if I try a >> >> second "xm save ..." it simply seems to hang. Nothing happens and the >> >> last thing in the logs are these lines: >> > >> >Is this the same with both 2.4 and 2.6 domUs? I''ve noticed something >> > similar with 2.0.7 but only with 2.4 domUs ... it would be useful to know >> > if it affects 2.6 also - I''m trying to track it down. > >yes, it''s the same with 2.4 and 2.6 domUs... > >> There''s a very similiar problem in 3.0 that has to do with a race >> condition with the xc_save/Xend interaction. xc_save thinks it has sent >> the "suspend" command over the pipe and Xend is waiting for it to arrive. > >... but after some more testing I noticed another interessting thing. "xm >save" hangs if the suspend file doesn''t exist. For the first time after a >dom0 reboot it''s normaly no problem, but if I delete the file and try a "xm >save" again it will not work for 95%. > >If I keep the save-file and then make a "xm save" and a "xm restore" it seems >to be no problem. I made 10 tests and all worked.Fix attached below - it''s actually nothing to do with whether the file exists or not. Rather the problem is that on occasion xfrd sends a response and a request in the same ''message'', and Xend only deals with the first. The below fixes this for me - please let me know if it works for you, cheers, S. diff -r 973a2d3c7a63 tools/python/xen/xend/XendMigrate.py --- a/tools/python/xen/xend/XendMigrate.py Wed Aug 3 23:24:27 2005 +++ b/tools/python/xen/xend/XendMigrate.py Thu Aug 18 19:14:42 2005 @@ -54,7 +54,7 @@ def dataReceived(self, data): self.parser.input(data) - if self.parser.ready(): + while(self.parser.ready()): val = self.parser.get_val() self.xinfo.dispatch(self, val) if self.parser.at_eof(): _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Am Freitag, 19. August 2005 04:14 schrieb Steven Hand:> >Am Montag, 15. August 2005 23:29 schrieb Anthony Liguori: > >> Steven Hand wrote: > >> >>I am using Xen-2.0.7 on a Dual Intel Xeon 2.8GHz system with 4GB of > >> >> ram. I am using 2.6.11 as kernel for my domain 0. Domain 0 uses > >> >> Debian Sarge with a backported Xen 2.0.7 package (only litte changes > >> >> to the debian 2.0.6 package, nothing important enough to get > >> >> metioned). All kernels were compiled against vanilla kernels with > >> >> xen-patch. The domain U''s are using 2.6.11 or 2.4.30 (debian, suse). > >> >> > >> >>I have no problems within domains and everything is running very > >> >> smoothly, exepct one thing (which was also not working correctly in > >> >> xen-2.0.6 for me): I can save a domain with "xm save <domainname> > >> >> <suspendfile>" once and I can restore this domain again, but if I try > >> >> a second "xm save ..." it simply seems to hang. Nothing happens and > >> >> the last thing in the logs are these lines: > >> > > >> >Is this the same with both 2.4 and 2.6 domUs? I''ve noticed something > >> > similar with 2.0.7 but only with 2.4 domUs ... it would be useful to > >> > know if it affects 2.6 also - I''m trying to track it down. > > > >yes, it''s the same with 2.4 and 2.6 domUs... > > > >> There''s a very similiar problem in 3.0 that has to do with a race > >> condition with the xc_save/Xend interaction. xc_save thinks it has sent > >> the "suspend" command over the pipe and Xend is waiting for it to > >> arrive. > > > >... but after some more testing I noticed another interessting thing. "xm > >save" hangs if the suspend file doesn''t exist. For the first time after a > >dom0 reboot it''s normaly no problem, but if I delete the file and try a > > "xm save" again it will not work for 95%. > > > >If I keep the save-file and then make a "xm save" and a "xm restore" it > > seems to be no problem. I made 10 tests and all worked. > > Fix attached below - it''s actually nothing to do with whether the file > exists or not. Rather the problem is that on occasion xfrd sends a response > and a request in the same ''message'', and Xend only deals with the first. > > The below fixes this for me - please let me know if it works for you,I can''t test it right now, because the server is in production use now. I have to schedule a maintaince window to reboot the system (and that is needed if the problem is not fixed and a "xm save" crashes. I tried to reproduce the bug on my notebook and on a normal desktop pc, but there I haven''t any problems with "xm save" at all. The only difference between my notebook/desktop system and the production system is that the production system is a smp system (2x xeon cpu''s) with hyperthreading enabled. And there was definitly a difference if I delete the file everytime before I make a "xm save" or not. I am not saying that the bug has something to do with the file itself, but maybe it just triggers the error (because creating a file takes longer than overwriting?!?). Maybe thats why the problem exists once and the next time not. I let you know if I could test the patch on the production system (or another smp/ht system), but that can take some more days... sorry. thx for your help, --Ralph> > cheers, > > S. > > > > diff -r 973a2d3c7a63 tools/python/xen/xend/XendMigrate.py > --- a/tools/python/xen/xend/XendMigrate.py Wed Aug 3 23:24:27 2005 > +++ b/tools/python/xen/xend/XendMigrate.py Thu Aug 18 19:14:42 2005 > @@ -54,7 +54,7 @@ > > def dataReceived(self, data): > self.parser.input(data) > - if self.parser.ready(): > + while(self.parser.ready()): > val = self.parser.get_val() > self.xinfo.dispatch(self, val) > if self.parser.at_eof(): > > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
>Am Freitag, 19. August 2005 04:14 schrieb Steven Hand: >> >Am Montag, 15. August 2005 23:29 schrieb Anthony Liguori: >> >> Steven Hand wrote: >> >> >>I am using Xen-2.0.7 on a Dual Intel Xeon 2.8GHz system with 4GB of >> >> >> ram. I am using 2.6.11 as kernel for my domain 0. Domain 0 uses >> >> >> Debian Sarge with a backported Xen 2.0.7 package (only litte changes >> >> >> to the debian 2.0.6 package, nothing important enough to get >> >> >> metioned). All kernels were compiled against vanilla kernels with >> >> >> xen-patch. The domain U''s are using 2.6.11 or 2.4.30 (debian, suse). >> >> >> >> >> >>I have no problems within domains and everything is running very >> >> >> smoothly, exepct one thing (which was also not working correctly in >> >> >> xen-2.0.6 for me): I can save a domain with "xm save <domainname> >> >> >> <suspendfile>" once and I can restore this domain again, but if I try >> >> >> a second "xm save ..." it simply seems to hang. Nothing happens and >> >> >> the last thing in the logs are these lines: >> >> > >> >> >Is this the same with both 2.4 and 2.6 domUs? I''ve noticed something >> >> > similar with 2.0.7 but only with 2.4 domUs ... it would be useful to >> >> > know if it affects 2.6 also - I''m trying to track it down. >> > >> >yes, it''s the same with 2.4 and 2.6 domUs... >> > >> >> There''s a very similiar problem in 3.0 that has to do with a race >> >> condition with the xc_save/Xend interaction. xc_save thinks it has sent >> >> the "suspend" command over the pipe and Xend is waiting for it to >> >> arrive. >> > >> >... but after some more testing I noticed another interessting thing. "xm >> >save" hangs if the suspend file doesn''t exist. For the first time after a >> >dom0 reboot it''s normaly no problem, but if I delete the file and try a >> > "xm save" again it will not work for 95%. >> > >> >If I keep the save-file and then make a "xm save" and a "xm restore" it >> > seems to be no problem. I made 10 tests and all worked. >> >> Fix attached below - it''s actually nothing to do with whether the file >> exists or not. Rather the problem is that on occasion xfrd sends a response >> and a request in the same ''message'', and Xend only deals with the first. >> >> The below fixes this for me - please let me know if it works for you, > >I can''t test it right now, because the server is in production use now. I have >to schedule a maintaince window to reboot the system (and that is needed if >the problem is not fixed and a "xm save" crashes.Ok (although I''m confident the fix is a strict stability improvement - I stress tested over 15,000 save/restore cycles at a variety of frequencies without a single problem). But then again, it''s your server :-) Since the problem was a race condition and hence timing (and concurrency at the hardware level) are likely to affect the probability of it occurring. So e.g. SMP versus not, or slow versus fast machine, or anything like this could increase the chance you''d see it.>I let you know if I could test the patch on the production system (or another >smp/ht system), but that can take some more days... sorry.No probs - the fix is in 2.0-testing but that also includes a bunch of other stuff, so probably best to just apply that patch locally. cheers, S. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Am Montag, 22. August 2005 21:45 schrieb Steven Hand:> >Am Freitag, 19. August 2005 04:14 schrieb Steven Hand: > >> >Am Montag, 15. August 2005 23:29 schrieb Anthony Liguori: > >> >> Steven Hand wrote: > >> >> >>I am using Xen-2.0.7 on a Dual Intel Xeon 2.8GHz system with 4GB of > >> >> >> ram. I am using 2.6.11 as kernel for my domain 0. Domain 0 uses > >> >> >> Debian Sarge with a backported Xen 2.0.7 package (only litte > >> >> >> changes to the debian 2.0.6 package, nothing important enough to > >> >> >> get metioned). All kernels were compiled against vanilla kernels > >> >> >> with xen-patch. The domain U''s are using 2.6.11 or 2.4.30 (debian, > >> >> >> suse). > >> >> >> > >> >> >>I have no problems within domains and everything is running very > >> >> >> smoothly, exepct one thing (which was also not working correctly > >> >> >> in xen-2.0.6 for me): I can save a domain with "xm save > >> >> >> <domainname> <suspendfile>" once and I can restore this domain > >> >> >> again, but if I try a second "xm save ..." it simply seems to > >> >> >> hang. Nothing happens and the last thing in the logs are these > >> >> >> lines: > >> >> > > >> >> >Is this the same with both 2.4 and 2.6 domUs? I''ve noticed something > >> >> > similar with 2.0.7 but only with 2.4 domUs ... it would be useful > >> >> > to know if it affects 2.6 also - I''m trying to track it down. > >> > > >> >yes, it''s the same with 2.4 and 2.6 domUs... > >> > > >> >> There''s a very similiar problem in 3.0 that has to do with a race > >> >> condition with the xc_save/Xend interaction. xc_save thinks it has > >> >> sent the "suspend" command over the pipe and Xend is waiting for it > >> >> to arrive. > >> > > >> >... but after some more testing I noticed another interessting thing. > >> > "xm save" hangs if the suspend file doesn''t exist. For the first time > >> > after a dom0 reboot it''s normaly no problem, but if I delete the file > >> > and try a "xm save" again it will not work for 95%. > >> > > >> >If I keep the save-file and then make a "xm save" and a "xm restore" it > >> > seems to be no problem. I made 10 tests and all worked. > >> > >> Fix attached below - it''s actually nothing to do with whether the file > >> exists or not. Rather the problem is that on occasion xfrd sends a > >> response and a request in the same ''message'', and Xend only deals with > >> the first. > >> > >> The below fixes this for me - please let me know if it works for you, > > > >I can''t test it right now, because the server is in production use now. I > > have to schedule a maintaince window to reboot the system (and that is > > needed if the problem is not fixed and a "xm save" crashes. > > Ok (although I''m confident the fix is a strict stability improvement - I > stress tested over 15,000 save/restore cycles at a variety of frequencies > without a single problem). > > But then again, it''s your server :-) > > Since the problem was a race condition and hence timing (and concurrency > at the hardware level) are likely to affect the probability of it > occurring. So e.g. SMP versus not, or slow versus fast machine, or anything > like this could increase the chance you''d see it. > > >I let you know if I could test the patch on the production system (or > > another smp/ht system), but that can take some more days... sorry. > > No probs - the fix is in 2.0-testing but that also includes a bunch of > other stuff, so probably best to just apply that patch locally.Hi Steven, I tried the your patch last night after announcing a short downtime to our customers. After applying the patch and rebuilding xen the problems were gone. I tried 200 saves & restores on diffrent domUs and had no problem at all. Now it doesn''t care if the save file exists before the "xm save" command or not. I know that this was not the bug itself, but the file exists thing triggered the race condition on our system before. thanks for your help and work on xen... regards, Ralph> > cheers, > > S. > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users