Ian Jackson
2011-Feb-04 18:08 UTC
[Xen-devel] Probable Xen bug triggered by localhost migration
Once again I have had a test fail during "10 migrations of a PV domain to localhost", with an apparent Xen or dom0 lockup or other serious problem. Failure modes include: * dom0 reporting soft lockup BUGs (showing xl stuck in a privcmd ioctl, apparently in a hypercall) * dom0 disk controller failure due to apparent lost/stuck interrupt (dom0 decides disk not working, tries unsuccessfully to reset) * apparent dom0 lockup or networking failure Problems occur with both XCP 2.6.27 and pvops 2.6.32 kernels. Problems seem only to happen with xl but that''s likely to be because it''s due to a race; xl and xend will make various calls in different orders and with different timing. Having added some machinery to request Xen debug keys, I now have some more information: http://www.chiark.greenend.org.uk/~xensrcts/logs/5639/test-amd64-i386-xl-credit2/info.html The most relevant files there are these: http://www.chiark.greenend.org.uk/~xensrcts/logs/5639/test-amd64-i386-xl-credit2/14.ts-guest-localmigrate.log That shows the failure. The test harness ssh''s to the dom0 to run "xl migrate" and gets "No route to host", which typically means it has stopped responding to arp requests. In this particular case the failure happened after an apparently-successful previous migration, but the more common failure mode is that "xl migrate" prints the 0% progress message and then nothing else gets through. http://www.chiark.greenend.org.uk/~xensrcts/logs/5639/test-amd64-i386-xl-credit2/serial-woodlouse.log Serial log. Scroll to around "Feb 4 03:30:35" (timestamps, and the messages about clients connecting and disconnecting, are from the serial concentrator). You''ll see a series of debug key outputs, which you can correlate with the test harness''s requests, listed with timestamps here: http://www.chiark.greenend.org.uk/~xensrcts/logs/5639/test-amd64-i386-xl-credit2/15.ts-logs-capture.log After the Xen debug keys have been run through, the test harness sends the "q" guest debug key, which also produces the output you can see in the serial log. Then the test harness switches the serial back to dom0 and sends RET and we can see dom0 produce a new login prompt. So dom0 is not entirely dead. However, later entries in the "ts-logs-capture" log show that it still isn''t responding to the network, and eventually the test harness decides to power cycle the host and collect what remains from the dom0 filesystem. So that''s why you see a pile of boot messages at the end of the test log - these should be disregarded. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2011-Feb-07 11:10 UTC
Re: [Xen-devel] Probable Xen bug triggered by localhost migration
At 18:08 +0000 on 04 Feb (1296842938), Ian Jackson wrote:> http://www.chiark.greenend.org.uk/~xensrcts/logs/5639/test-amd64-i386-xl-credit2/serial-woodlouse.log > > Serial log. Scroll to around "Feb 4 03:30:35" (timestamps, and the > messages about clients connecting and disconnecting, are from the > serial concentrator).Whatever happened here, we missed it. By the time the debug keys are sent the damage is done and the whole system is idle. I don''t see anything particularly worrying in the rest of the system - in particular the time skew looks quite sane. Were some of the other failures on non-credit2 tests? Can you try adding "watchdog" to the Xen command-line? Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Xen Platform Team Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Jackson
2011-Feb-07 12:18 UTC
Re: [Xen-devel] Probable Xen bug triggered by localhost migration
Tim Deegan writes ("Re: [Xen-devel] Probable Xen bug triggered by localhost migration"):> Whatever happened here, we missed it. By the time the debug keys are > sent the damage is done and the whole system is idle. I don''t see > anything particularly worrying in the rest of the system - in particular > the time skew looks quite sane.Hrm.> Were some of the other failures on non-credit2 tests?Yes.> Can you try adding "watchdog" to the Xen command-line?I will arrange for it to do so. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Jackson
2011-Feb-09 11:55 UTC
Re: [Xen-devel] Probable Xen bug triggered by localhost migration
Tim Deegan writes ("Re: [Xen-devel] Probable Xen bug triggered by localhost migration"):> Whatever happened here, we missed it. By the time the debug keys are > sent the damage is done and the whole system is idle. I don''t see > anything particularly worrying in the rest of the system - in particular > the time skew looks quite sane.Here''s another one. In this case it looks like the disk controller interrupt has gone awol: http://www.chiark.greenend.org.uk/~xensrcts/logs/5673/test-i386-i386-xl/info.html http://www.chiark.greenend.org.uk/~xensrcts/logs/5673/test-i386-i386-xl/serial-woodlouse.log The serial log shows flailings from dom0 interspersed with Xen debug output. The debug output seems a bit partial TBH; I think the test harness may still be struggling with bugs in the serial port access tool I''m using.> Can you try adding "watchdog" to the Xen command-line?This is in the pipeline. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tim Deegan
2011-Feb-09 12:10 UTC
Re: [Xen-devel] Probable Xen bug triggered by localhost migration
At 11:55 +0000 on 09 Feb (1297252516), Ian Jackson wrote:> Here''s another one. In this case it looks like the disk controller > interrupt has gone awol: > > http://www.chiark.greenend.org.uk/~xensrcts/logs/5673/test-i386-i386-xl/info.html > http://www.chiark.greenend.org.uk/~xensrcts/logs/5673/test-i386-i386-xl/serial-woodlouse.logAh, this one has a much clearer problem - when I made the log-dirty bitmap allocate pages from the p2m pool I introduced a path where the log-dirty lock and the shadow lock can be taken in the wrong order. :( I''ll fix that now, but since it didn''t show up in the other trace there''s at least one more bug out there somewhere. Cheers, Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Xen Platform Team Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel