Florian Kirstein
2007-Jan-17 22:36 UTC
[Xen-devel] Two problems with DomU reboot (cmdline, duplicate domains)
Hi, just upgraded my testsystem to 3.0.4 (using the provided source rpm, xen-3.0.4.1-1.src.rpm, rebuilt it to run on non-PXE but no changes besides that, and it doesn''t look like the 3.0.4-testing HG has major differences so far?) and I have two problems with rebooting domains (xm reboot, or reboot from inside the domain). 1) the kernel-commandline keeps growing. On the first boot it''s OK, on the second it''s there 3-times as one long cmdline: ip=172.16.37.9:1.2.3.4:172.16.37.19:255.255.255.0::eth0:off root=/dev/sda1 ro ip=172.16.37.9:1.2.3.4:172.16.37.19:255.255.255.0::eth0:off root=/dev/sda1 ro ip=172.16.37.9:1.2.3.4:172.16.37.19:255.255.255.0::eth0:off root=/dev/sda1 ro and on the next reboot it''s longer than the kernel supports, which usually breaks networking as the kernel seems to use the last "ip=" parameter which most probably is incomplete then. It also happens when I don''t specify networking there, using a config almost identical to xmexample1, after the first reboot: # cat /proc/cmdline root=/dev/sda1 ro root=/dev/sda1 ro root=/dev/sda1 ro 3 interesting that the "3" from the "extra" parameter is not doubled... 2) it happend twice so far (non-reproducable for me yet) that after a reboot I had the same DomU twice, using the same name and the same blockdevices (LVM based phy: devices). This of course resulted in major data corruption and really doesn''t make me feel well. I read there were changes in the code which should prevent this, but for me it seems like it got worse, had this never before... Thanks for any ideas for this, will try the current 3.0.4-testing hg later, but as I currently can''t reproduce "2)" I hope someone here knows if this is fixed and by what this could have been triggered? (:ul8er, r@y P.S: Oh, I could reproduce it a third time, think I just issued another reboot on the domain. Probably while having a libvirt-based tool list domains in the background, doing some tests there now... However, this is how it looks: DomU9 72 50 1 -b---- 8.8 DomU9 73 50 1 -b---- 8.1 and both are actually running, I can attach to both consoles using /usr/lib/xen/bin/xenconsole 72 /usr/lib/xen/bin/xenconsole 73 and they are indeed different instances running on the same blockdevices. Ouch :) P.P.S: Here''s an excerpt from the xend.log (from the first time it happend) which I suppose shows the part where the duplicate Domain was created, unfortunately the log-level was set to INFO: [2007-01-17 16:28:04 xend.XendDomainInfo 12761] INFO (XendDomainInfo:969) Domain has shutdown: name=DomU9 id=18 reason=reboot. [2007-01-17 16:28:04 xend.XendDomainInfo 12761] INFO (XendDomainInfo:969) Domain has shutdown: name=DomU9 id=18 reason=reboot. [2007-01-17 16:28:04 xend.XendDomainInfo 12761] INFO (XendDomainInfo:969) Domain has shutdown: name=DomU9 id=18 reason=reboot. [2007-01-17 16:28:05 xend.XendDomainInfo 12761] ERROR (XendDomainInfo:1063) Xend failed during restart of domain None. Refusing to restart to avoid loops. [2007-01-17 16:28:05 xend.XendConfig 12761] WARNING (XendConfig:606) Unconverted key: cpus [2007-01-17 16:28:05 xend 12761] INFO (image:125) buildDomain os=linux dom=19 vc pus=1 [2007-01-17 16:28:05 xend.XendDomainInfo 12761] INFO (XendDomainInfo:1194) creat eDevice: vbd : {''uuid'': ''5f42ac79-37a5-3511-5586-d215a356aa2d'', ''driver'': ''parav irtualised'', ''dev'': ''sda1:disk'', ''uname'': ''phy:/dev/vgrc/h_root_110'', ''mode'': ''w '', ''backend'': 0} [2007-01-17 16:28:05 xend.XendDomainInfo 12761] INFO (XendDomainInfo:1194) creat eDevice: vbd : {''uuid'': ''a6f7ccdb-f331-211f-1721-22a8d7aa2097'', ''driver'': ''parav irtualised'', ''dev'': ''sda2:disk'', ''uname'': ''phy:/dev/vgrc/swap110'', ''mode'': ''w'', ''backend'': 0} [2007-01-17 16:28:05 xend.XendDomainInfo 12761] INFO (XendDomainInfo:1194) creat eDevice: vif : {''ip'': ''172.16.37.9'', ''mac'': ''00:16:3e:00:32:f1'', ''script'': ''vi f-route'', ''uuid'': ''28a916b8-0e7c-cd43-81b6-fd35bf48ecae'', ''backend'': 0} [2007-01-17 16:28:06 xend.XendConfig 12761] WARNING (XendConfig:606) Unconverted key: cpus [2007-01-17 16:28:06 xend 12761] INFO (image:125) buildDomain os=linux dom=20 vc pus=1 [2007-01-17 16:28:06 xend.XendDomainInfo 12761] INFO (XendDomainInfo:1194) creat eDevice: vbd : {''uname'': ''phy:/dev/vgrc/h_root_110'', ''driver'': ''paravirtualised'' , ''mode'': ''w'', ''dev'': ''sda1'', ''uuid'': ''5f42ac79-37a5-3511-5586-d215a356aa2d''} [2007-01-17 16:28:06 xend.XendDomainInfo 12761] INFO (XendDomainInfo:1194) creat eDevice: vbd : {''uname'': ''phy:/dev/vgrc/swap110'', ''driver'': ''paravirtualised'', '' mode'': ''w'', ''dev'': ''sda2'', ''uuid'': ''a6f7ccdb-f331-211f-1721-22a8d7aa2097''} [2007-01-17 16:28:06 xend.XendDomainInfo 12761] INFO (XendDomainInfo:1194) creat eDevice: vif : {''ip'': ''172.16.37.9'', ''mac'': ''00:16:3e:00:32:f1'', ''script'': ''vi f-route'', ''uuid'': ''28a916b8-0e7c-cd43-81b6-fd35bf48ecae'', ''backend'': 0} _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Florian Kirstein
2007-Jan-19 09:41 UTC
[Xen-devel] [PATCH] fix: growing kernel commandline
Hi, replying to myself... Obviously my problem No.1 (growing commandline) is already known, but I didn''t find much besides a small comment from Ewan Mellor yesterday here on the list about it, after finding the relevant source parts responsible. So for the interested few, it''s caused by John''s Patch... # User john.levon@sun.com # Date 1167936545 28800 # Node ID acda3f65d9797126035cc8cae65d8804415c6036 ...and is already fixed in xen-3.0-unstable. But unfortunately the released 3.0.4.1 is broken, so I transfered the patch from 3.0-unstable and modified it a bit, so the commandline really remains in the same order as in previous xen versions (in testing ip= and root= are swapped). Whoever needs that, my scripts don''t :). Oh, and one more thing: what''s the idea behind the ip=[^ ] regexp in the test? Different from the root= parameter check, this only matches non-empty ip parameters, so if there''s an empty ip= parameter, we add our ip parameter anyway. Which won''t change a thing in the "new order", as I think the kernel uses the last ip= parameter it finds, which then still is the empty one? However, I left this unchanged... So: I''m not sure if users are supposed to post patches also for the xen-3.0.4-testing.hg repository, but as most productive environments are probably using that instead of -unstable, here''s the patch for those who are annoyed by this problem and want to patch their local sources: # HG changeset patch # User ray@build.ray.net # Node ID c25e4e8a9668fc25c0424c2936d2e4f94345ab89 # Parent f98a6a9df1b4ea6022d05cdb2d189cb7645408d2 Fix kernel commandline generation to prevent duplication of ip= and root= parameters on reboot, while preserving the parameter ordering known from previous versions Signed-off-by: Florian Kirstein <ray@ray.net> diff -r f98a6a9df1b4 -r c25e4e8a9668 tools/python/xen/xend/XendConfig.py --- a/tools/python/xen/xend/XendConfig.py Mon Jan 8 12:54:41 2007 -0800 +++ b/tools/python/xen/xend/XendConfig.py Fri Jan 19 10:12:20 2007 +0100 @@ -1104,19 +1104,15 @@ class XendConfig(dict): self[''PV_kernel''] = sxp.child_value(image_sxp, ''kernel'','''') self[''PV_ramdisk''] = sxp.child_value(image_sxp, ''ramdisk'','''') - kernel_args = "" + kernel_args = sxp.child_value(image_sxp, ''args'', '''') # attempt to extract extra arguments from SXP config + arg_root = sxp.child_value(image_sxp, ''root'') + if arg_root and not re.search(r''root='', kernel_args): + kernel_args = ''root=%s '' % arg_root + kernel_args arg_ip = sxp.child_value(image_sxp, ''ip'') if arg_ip and not re.search(r''ip=[^ ]+'', kernel_args): - kernel_args += ''ip=%s '' % arg_ip - arg_root = sxp.child_value(image_sxp, ''root'') - if arg_root and not re.search(r''root='', kernel_args): - kernel_args += ''root=%s '' % arg_root - - # user-specified args must come last: previous releases did this and - # some domU kernels rely upon the ordering. - kernel_args += sxp.child_value(image_sxp, ''args'', '''') + kernel_args = ''ip=%s '' % arg_ip + kernel_args self[''PV_args''] = kernel_args Now I''m still stuck with my other (duplicate created DomUs shreddering the filesystem) problem, will do tests to reproduce that later today... (:ul8er, r@y _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2007-Jan-19 10:29 UTC
Re: [Xen-devel] [PATCH] fix: growing kernel commandline
On Fri, 2007-01-19 at 10:41 +0100, Florian Kirstein wrote:> Hi, > > replying to myself... Obviously my problem No.1 (growing commandline) > is already known, but I didn''t find much besides a small comment from > Ewan Mellor yesterday here on the list about it, after finding the > relevant source parts responsible. > > So for the interested few, it''s caused by John''s Patch... > # User john.levon@sun.com > # Date 1167936545 28800 > # Node ID acda3f65d9797126035cc8cae65d8804415c6036 > ...and is already fixed in xen-3.0-unstable. But unfortunately the released > 3.0.4.1 is broken, so I transfered the patch from 3.0-unstable and modified > it a bit, so the commandline really remains in the same order as in > previous xen versions (in testing ip= and root= are swapped). Whoever > needs that, my scripts don''t :).Thanks for doing this. I''m actually just about to push a straight backport of the fix which went into unstable. If further fixes are required on top of that we should consider making them in xen-unstable first. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Florian Kirstein
2007-Jan-19 11:59 UTC
Re: [Xen-devel] [PATCH] fix: growing kernel commandline
Hi,> I''m actually just about to push a straight backport of the fix which > went into unstable. If further fixes are required on top of that we > should consider making them in xen-unstable first.I agree for anything further than what I''ve done. But the switching of ip= and root= parameter in the processing, basically just an exchange of the two blocks: arg_ip = sxp.child_value(image_sxp, ''ip'') if arg_ip and not re.search(r''ip=[^ ]+'', kernel_args): kernel_args = ''ip=%s '' % arg_ip + kernel_args and: arg_root = sxp.child_value(image_sxp, ''root'') if arg_root and not re.search(r''root='', kernel_args): kernel_args = ''root=%s '' % arg_root + kernel_args seems to be riskless, if you do this additional to the -unstable patch we should be on the safe side, I hope :) Otherwise ip= and root= parameter will change their order on the commandline (because they now prepend instead of append themselves), and I thought keeping that intact for bad parsers was the main reason for all of this... That''s why I did this change when backporting the fix. (:ul8er, r@y _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Florian Kirstein
2007-Jan-21 08:59 UTC
[Xen-devel] Bug: Problematic DomU Duplication on reboot
Hi, OK, I did some more experiments and can now reproduce the duplication of a domain on it''s reboot. Seems to be a race condition somewhere, as I can trigger it by putting high load on xend. The really bad thing: all instances of the domain are then actively running on the same block devices, which almost certainly causes massive data corruption :-( And: it also can happen in normal operation, I had it at least twice in a "normal" environment without much load on xend, possibly just a libvirt request at the wrong time during a DomU reboot. If this is already known: sorry for the long mail then... Is there a fix for 3.0.4-testing? :) If not: I more or less see two Bugs there: 1) why is the domain multiplicated during the reboot 2) why is it possible at all that it''s started twice, using the same devices? Could there be a check added to prevent duplicate use of the same device readwrite, or is there already one which is failing in this case? Reproduction: I was able to reproduce this quite reliably using the sample-program dump-info.pl from the perl-Sys-virt libvirt Interface. I (as root) just do a while true; do ./dump-info.pl; done in the examples dir to stress the system/xend. Building the loop inside dump-info.pl and removing all "print"s even makes it work a bit "better" and really messing things up, so try that if the other doesn''t work. I tested it on a P4 3 GHz and a Dualcore A64 2.2Ghz, it''s easier when I use nosmp on the xen kernel on the A64 but it works also in the SMP case. While this is running I simply issue: xm reboot DomU1 and most of the times it results in two or more DomU1s running afterwards... Sometimes it also causes DomU1 to disappear, having an entry in the log it was rebooting too fast (of course I waited long enough with the reboot). If it "works" it looks like this: DomU1 97 256 1 -b---- 12.5 DomU1 98 256 1 -b---- 12.9 afterwards. DomU1 being just a normal paravirtualized Linux Guest. Dom0 is a CentOS 4 in case it could matter. Observations: During the reboot sometimes multiple duplications were created, load on Dom0 went up to about 30 and I saw lots of xen-backend hotplug agents: 10613 ? S< 0:00 \_ /bin/sh /sbin/hotplug xen-backend 10617 ? S< 0:01 | \_ /bin/sh /etc/hotplug/xen-backend.agent 15018 ? S< 0:00 \_ /bin/sh /sbin/hotplug xen-backend 15248 ? S< 0:01 | \_ /bin/sh /etc/hotplug/xen-backend.agent 14698 ? S< 0:00 \_ /bin/sh /sbin/hotplug xen-backend 14702 ? S< 0:00 | \_ /bin/sh /etc/hotplug/xen-backend.agent 15091 ? S< 0:00 \_ /bin/sh /sbin/hotplug xen-backend (about 60 more lines like this - and I had just one domU). After everything settled the result: VM100 38 256 1 -b---- 13.3 VM100 10 256 1 -b---- 14.1 Noticable the large difference from 10-38, meaning 27 domains were partially crated and then died, the Domain I rebooted had ID 9. Oh, and one more thing: when using "stress" to put load on the Dom0 system instead of the perl-Sys-virt tool, it usually causes the DomU to disappear on reboot, but I couldn''t reproduce the duplication that way. All this done with the released 3.0.4.1-1, will try xen-unstable next, but possibly someone already as an idea what could be wrong here? (:ul8er, r@y _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Possibly Parallel Threads
- [PATCH 0 of 2] Parse image elfnotes, write them to xenstore, save and load via image sxpr
- [PATCH 1/7] Fix pygrub path on Solaris
- [BUG, PATCH] xen-4.1-3 xend/XendDomainInfo.py#device_configure() TypeError
- [PATCH] Values of cpu_weight and cpu_cap are lost after xend restart
- RE: Individual passwords for guest VNC servers ?