Ian Jackson
2011-Feb-04 18:08 UTC
[Xen-devel] Probable Xen bug triggered by localhost migration
Once again I have had a test fail during "10 migrations of a PV domain
to localhost", with an apparent Xen or dom0 lockup or other serious
problem.
Failure modes include:
* dom0 reporting soft lockup BUGs (showing xl stuck in a privcmd
ioctl, apparently in a hypercall)
* dom0 disk controller failure due to apparent lost/stuck
interrupt (dom0 decides disk not working, tries unsuccessfully to
reset)
* apparent dom0 lockup or networking failure
Problems occur with both XCP 2.6.27 and pvops 2.6.32 kernels.
Problems seem only to happen with xl but that''s likely to be because
it''s due to a race; xl and xend will make various calls in different
orders and with different timing.
Having added some machinery to request Xen debug keys, I now have some
more information:
http://www.chiark.greenend.org.uk/~xensrcts/logs/5639/test-amd64-i386-xl-credit2/info.html
The most relevant files there are these:
http://www.chiark.greenend.org.uk/~xensrcts/logs/5639/test-amd64-i386-xl-credit2/14.ts-guest-localmigrate.log
That shows the failure. The test harness ssh''s to the dom0 to run
"xl
migrate" and gets "No route to host", which typically means it
has
stopped responding to arp requests. In this particular case the
failure happened after an apparently-successful previous migration,
but the more common failure mode is that "xl migrate" prints the 0%
progress message and then nothing else gets through.
http://www.chiark.greenend.org.uk/~xensrcts/logs/5639/test-amd64-i386-xl-credit2/serial-woodlouse.log
Serial log. Scroll to around "Feb 4 03:30:35" (timestamps, and the
messages about clients connecting and disconnecting, are from the
serial concentrator).
You''ll see a series of debug key outputs, which you can correlate with
the test harness''s requests, listed with timestamps here:
http://www.chiark.greenend.org.uk/~xensrcts/logs/5639/test-amd64-i386-xl-credit2/15.ts-logs-capture.log
After the Xen debug keys have been run through, the test harness sends
the "q" guest debug key, which also produces the output you can see in
the serial log.
Then the test harness switches the serial back to dom0 and sends RET
and we can see dom0 produce a new login prompt. So dom0 is not
entirely dead.
However, later entries in the "ts-logs-capture" log show that it still
isn''t responding to the network, and eventually the test harness
decides to power cycle the host and collect what remains from the dom0
filesystem. So that''s why you see a pile of boot messages at the end
of the test log - these should be disregarded.
Ian.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Tim Deegan
2011-Feb-07 11:10 UTC
Re: [Xen-devel] Probable Xen bug triggered by localhost migration
At 18:08 +0000 on 04 Feb (1296842938), Ian Jackson wrote:> http://www.chiark.greenend.org.uk/~xensrcts/logs/5639/test-amd64-i386-xl-credit2/serial-woodlouse.log > > Serial log. Scroll to around "Feb 4 03:30:35" (timestamps, and the > messages about clients connecting and disconnecting, are from the > serial concentrator).Whatever happened here, we missed it. By the time the debug keys are sent the damage is done and the whole system is idle. I don''t see anything particularly worrying in the rest of the system - in particular the time skew looks quite sane. Were some of the other failures on non-credit2 tests? Can you try adding "watchdog" to the Xen command-line? Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Xen Platform Team Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Jackson
2011-Feb-07 12:18 UTC
Re: [Xen-devel] Probable Xen bug triggered by localhost migration
Tim Deegan writes ("Re: [Xen-devel] Probable Xen bug triggered by localhost
migration"):> Whatever happened here, we missed it. By the time the debug keys are
> sent the damage is done and the whole system is idle. I don''t see
> anything particularly worrying in the rest of the system - in particular
> the time skew looks quite sane.
Hrm.
> Were some of the other failures on non-credit2 tests?
Yes.
> Can you try adding "watchdog" to the Xen command-line?
I will arrange for it to do so.
Ian.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Ian Jackson
2011-Feb-09 11:55 UTC
Re: [Xen-devel] Probable Xen bug triggered by localhost migration
Tim Deegan writes ("Re: [Xen-devel] Probable Xen bug triggered by localhost
migration"):> Whatever happened here, we missed it. By the time the debug keys are
> sent the damage is done and the whole system is idle. I don''t see
> anything particularly worrying in the rest of the system - in particular
> the time skew looks quite sane.
Here''s another one. In this case it looks like the disk controller
interrupt has gone awol:
http://www.chiark.greenend.org.uk/~xensrcts/logs/5673/test-i386-i386-xl/info.html
http://www.chiark.greenend.org.uk/~xensrcts/logs/5673/test-i386-i386-xl/serial-woodlouse.log
The serial log shows flailings from dom0 interspersed with Xen debug
output.
The debug output seems a bit partial TBH; I think the test harness may
still be struggling with bugs in the serial port access tool I''m
using.
> Can you try adding "watchdog" to the Xen command-line?
This is in the pipeline.
Ian.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Tim Deegan
2011-Feb-09 12:10 UTC
Re: [Xen-devel] Probable Xen bug triggered by localhost migration
At 11:55 +0000 on 09 Feb (1297252516), Ian Jackson wrote:> Here''s another one. In this case it looks like the disk controller > interrupt has gone awol: > > http://www.chiark.greenend.org.uk/~xensrcts/logs/5673/test-i386-i386-xl/info.html > http://www.chiark.greenend.org.uk/~xensrcts/logs/5673/test-i386-i386-xl/serial-woodlouse.logAh, this one has a much clearer problem - when I made the log-dirty bitmap allocate pages from the p2m pool I introduced a path where the log-dirty lock and the shadow lock can be taken in the wrong order. :( I''ll fix that now, but since it didn''t show up in the other trace there''s at least one more bug out there somewhere. Cheers, Tim. -- Tim Deegan <Tim.Deegan@citrix.com> Principal Software Engineer, Xen Platform Team Citrix Systems UK Ltd. (Company #02937203, SL9 0BG) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel