When xm failed to do a live migration the system was resumed on the sending host. Xl does not do that it TRIES to, but just crashes the guest: kernel BUG at drivers/xen/events.c:1466! (In this example the target host didnt have the logical volume activated.) Now that cant be right. Imho xl should do some checking if the target is a viable migration target (are the disks and vifs there, is there enough memory available) and preserve a safe state on the sender until the guest has properly started on the receiver. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, 2011-09-20 at 10:42 +0100, Andreas Olsowski wrote:> When xm failed to do a live migration the system was resumed on the > sending host. > Xl does not do that it TRIES to, but just crashes the guest: > kernel BUG at drivers/xen/events.c:1466! > (In this example the target host didnt have the logical volume activated.) > > Now that cant be right.No, it''s a bug, perhaps in the kernel but likely in the toolstack. Please can you provide full logs from /var/log/xen on both ends. Running "xl -vvv migrate" will also produce more stuff on stdout, some of which may be useful. Also please capture the complete guest log in case it is an issue there.> Imho xl should do some checking if the target is a viable migration > target (are the disks and vifs there, is there enough memory available) > and preserve a safe state on the sender until the guest has properly > started on the receiver.xl does have some checks and does preserve the sender side until it gets confirmation of correct restart, but obviously something is wrong with the bit which restarts the old guest. I''m sure it worked at one point though. Thanks, Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Olsowski
2011-Sep-23 07:39 UTC
Re: [Xen-devel] pv guests die after failed migration
Here is the full procedure: Preparations: root@xenturio1:/var/log/xen# dmsetup ls |grep thiswillfail xen--data-thiswillfail--swap (252, 236) xen--data-thiswillfail--root (252, 235) root@xenturio2:/var/log/xen# dmsetup ls |grep thiswillfail >Server 2 does not have the logical volumes activated. root@xenturio1:/usr/src/linux-2.6-xen# xl create /mnt/vmctrl/xenconfig/thiswillfail.sxp Parsing config file /mnt/vmctrl/xenconfig/thiswillfail.sxp Daemon running with PID 6722 >it is in fact running with pid 6723: root@xenturio1:/usr/src/linux-2.6-xen# ps auxww |grep "xl create" root 6723 0.0 0.0 35616 972 ? Ssl 09:14 0:00 xl create /mnt/vmctrl/xenconfig/thiswillfail.sxp >Lets check the logfiles root@xenturio1:/var/log/xen# cat xen-hotplug.log RTNETLINK answers: Operation not supported RTNETLINK answers: Operation not supported >stupid netlink again, no matter what stuff i load into the kernel that >still pops up ... annoying ... anyway, its a non-issue in this case root@xenturio1:/var/log/xen# cat xl-thiswillfail.log Waiting for domain thiswillfail (domid 5) to die [pid 6723] >Lets not make it wait any longer ;) root@xenturio1:/usr/src/linux-2.6-xen# xl -vvv migrate thiswillfail xenturio2 migration target: Ready to receive domain. Saving to migration stream new xl format (info 0x0/0x0/380) Loading new save file incoming migration stream (new xl fmt info 0x0/0x0/380) Savefile contains xl domain config xc: detail: Had 0 unexplained entries in p2m table xc: Saving memory: iter 0 (last sent 0 skipped 0): 133120/133120 100% xc: detail: delta 9499ms, dom0 88%, target 2%, sent 451Mb/s, dirtied 1Mb/s 324 pages xc: Saving memory: iter 1 (last sent 130760 skipped 312): 133120/133120 100% xc: detail: delta 23ms, dom0 91%, target 0%, sent 455Mb/s, dirtied 48Mb/s 34 pages xc: Saving memory: iter 2 (last sent 320 skipped 4): 133120/133120 100% xc: detail: Start last iteration libxl: debug: libxl_dom.c:384:libxl__domain_suspend_common_callback issuing PV suspend request via XenBus control node libxl: debug: libxl_dom.c:389:libxl__domain_suspend_common_callback wait for the guest to acknowledge suspend request libxl: debug: libxl_dom.c:434:libxl__domain_suspend_common_callback guest acknowledged suspend request libxl: debug: libxl_dom.c:438:libxl__domain_suspend_common_callback wait for the guest to suspend libxl: debug: libxl_dom.c:450:libxl__domain_suspend_common_callback guest has suspended xc: detail: SUSPEND shinfo 0007fafc xc: detail: delta 206ms, dom0 2%, target 0%, sent 4Mb/s, dirtied 24Mb/s 154 pages xc: Saving memory: iter 3 (last sent 30 skipped 4): 133120/133120 100% xc: detail: delta 3ms, dom0 0%, target 0%, sent 1682Mb/s, dirtied 1682Mb/s 154 pages xc: detail: Total pages sent= 131264 (0.99x) xc: detail: (of which 0 were fixups) xc: detail: All memory is saved xc: detail: Save exit rc=0 libxl: error: libxl.c:900:validate_virtual_disk failed to stat /dev/xen-data/thiswillfail-root: No such file or directory cannot add disk 0 to domain: -6 migration target: Domain creation failed (code -3). libxl: error: libxl_utils.c:408:libxl_read_exactly file/stream truncated reading ready message from migration receiver stream libxl: info: libxl_exec.c:72:libxl_report_child_exitstatus migration target process [6837] exited with error status 3 Migration failed, resuming at sender. >Now see if it really is resumed at sender: root@xenturio1:/usr/src/linux-2.6-xen# xl console thiswillfail PM: freeze of devices complete after 0.207 msecs PM: late freeze of devices complete after 0.058 msecs ------------[ cut here ]------------ kernel BUG at drivers/xen/events.c:1466! invalid opcode: 0000 [#1] SMP CPU 0 Modules linked in: Pid: 6, comm: migration/0 Not tainted 3.0.4-xenU #6 RIP: e030:[<ffffffff8140d574>] [<ffffffff8140d574>] xen_irq_resume+0x224/0x370 RSP: e02b:ffff88001f9fbce0 EFLAGS: 00010082 RAX: ffffffffffffffef RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff88001f809ea8 RSI: ffff88001f9fbd00 RDI: 0000000000000001 RBP: 0000000000000010 R08: ffffffff81859a00 R09: 0000000000000000 R10: 0000000000000000 R11: 09f911029d74e35b R12: 0000000000000000 R13: 000000000000f0a0 R14: 0000000000000000 R15: ffff88001f9fbd00 FS: 00007ff28f8c8700(0000) GS:ffff88001fec6000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fff02056048 CR3: 000000001e4d8000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/0 (pid: 6, threadinfo ffff88001f9fa000, task ffff88001f9f7170) Stack: ffff88001f9fbd34 ffff88001f9fbd54 0000000000000003 000000000000f100 0000000000000000 0000000000000003 0000000000000000 0000000000000003 ffff88001fa6ddb0 ffffffff8140aa20 ffffffff81859a08 0000000000000000 Call Trace: [<ffffffff8140aa20>] ? gnttab_map+0x100/0x130 [<ffffffff815c2765>] ? _raw_spin_lock+0x5/0x10 [<ffffffff81083e01>] ? cpu_stopper_thread+0x101/0x190 [<ffffffff8140e1f5>] ? xen_suspend+0x75/0xa0 [<ffffffff81083f1b>] ? stop_machine_cpu_stop+0x8b/0xd0 [<ffffffff81083e90>] ? cpu_stopper_thread+0x190/0x190 [<ffffffff81083dd0>] ? cpu_stopper_thread+0xd0/0x190 [<ffffffff815c0870>] ? schedule+0x270/0x6c0 [<ffffffff81083d00>] ? copy_pid_ns+0x2a0/0x2a0 [<ffffffff81065846>] ? kthread+0x96/0xa0 [<ffffffff815c4024>] ? kernel_thread_helper+0x4/0x10 [<ffffffff815c3436>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff815c2be1>] ? retint_restore_args+0x5/0x6 [<ffffffff815c4020>] ? gs_change+0x13/0x13 Code: e8 f2 e9 ff ff 8b 44 24 10 44 89 e6 89 c7 e8 64 e8 ff ff ff c3 83 fb 04 0f 84 95 fe ff ff 4a 8b 14 f5 20 95 85 81 e9 68 ff ff ff <0f> 0b eb fe 0f 0b eb fe 48 8b 1d fd 00 42 00 4c 8d 6c 24 20 eb RIP [<ffffffff8140d574>] xen_irq_resume+0x224/0x370 RSP <ffff88001f9fbce0> ---[ end trace 82e2e97d58b5f835 ]--- > And here are the new versions of /var/log/xen root@xenturio1:/var/log/xen# cat xl-thiswillfail.log Waiting for domain thiswillfail (domid 5) to die [pid 6723] Domain 5 is dead Done. Exiting now >target servers /var/log/xen remains empty And that, was 3.0.4-xenU, same goes for 2.6.39-xenU. > Please can you provide full logs from /var/log/xen on both ends. Running > "xl -vvv migrate" will also produce more stuff on stdout, some of which > may be useful. > > Also please capture the complete guest log in case it is an issue there. I am not quite sure what you mean by "guest log". When you reply to this i should be much quicker to respond, had a hell of a week and didnt really get to check my list-mail until yesterday evening. I guess anyone with 2 machines running xen should easily be able to reproduce this problem. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, 2011-09-23 at 08:39 +0100, Andreas Olsowski wrote:> Here is the full procedure:[...] Thanks, I should be able to figure out a repro with this, although I may not get to it straight away.> root@xenturio1:/usr/src/linux-2.6-xen# xl console thiswillfail > PM: freeze of devices complete after 0.207 msecs > PM: late freeze of devices complete after 0.058 msecs > ------------[ cut here ]------------ > kernel BUG at drivers/xen/events.c:1466! > invalid opcode: 0000 [#1] SMP > CPU 0 > Modules linked in: > > Pid: 6, comm: migration/0 Not tainted 3.0.4-xenU #6 > RIP: e030:[<ffffffff8140d574>] [<ffffffff8140d574>] > xen_irq_resume+0x224/0x370 > RSP: e02b:ffff88001f9fbce0 EFLAGS: 00010082 > RAX: ffffffffffffffef RBX: 0000000000000000 RCX: 0000000000000000 > RDX: ffff88001f809ea8 RSI: ffff88001f9fbd00 RDI: 0000000000000001 > RBP: 0000000000000010 R08: ffffffff81859a00 R09: 0000000000000000 > R10: 0000000000000000 R11: 09f911029d74e35b R12: 0000000000000000 > R13: 000000000000f0a0 R14: 0000000000000000 R15: ffff88001f9fbd00 > FS: 00007ff28f8c8700(0000) GS:ffff88001fec6000(0000) knlGS:0000000000000000 > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00007fff02056048 CR3: 000000001e4d8000 CR4: 0000000000002660 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process migration/0 (pid: 6, threadinfo ffff88001f9fa000, task > ffff88001f9f7170) > Stack: > ffff88001f9fbd34 ffff88001f9fbd54 0000000000000003 000000000000f100 > 0000000000000000 0000000000000003 0000000000000000 0000000000000003 > ffff88001fa6ddb0 ffffffff8140aa20 ffffffff81859a08 0000000000000000 > Call Trace: > [<ffffffff8140aa20>] ? gnttab_map+0x100/0x130 > [<ffffffff815c2765>] ? _raw_spin_lock+0x5/0x10 > [<ffffffff81083e01>] ? cpu_stopper_thread+0x101/0x190 > [<ffffffff8140e1f5>] ? xen_suspend+0x75/0xa0 > [<ffffffff81083f1b>] ? stop_machine_cpu_stop+0x8b/0xd0 > [<ffffffff81083e90>] ? cpu_stopper_thread+0x190/0x190 > [<ffffffff81083dd0>] ? cpu_stopper_thread+0xd0/0x190 > [<ffffffff815c0870>] ? schedule+0x270/0x6c0 > [<ffffffff81083d00>] ? copy_pid_ns+0x2a0/0x2a0 > [<ffffffff81065846>] ? kthread+0x96/0xa0 > [<ffffffff815c4024>] ? kernel_thread_helper+0x4/0x10 > [<ffffffff815c3436>] ? int_ret_from_sys_call+0x7/0x1b > [<ffffffff815c2be1>] ? retint_restore_args+0x5/0x6 > [<ffffffff815c4020>] ? gs_change+0x13/0x13 > Code: e8 f2 e9 ff ff 8b 44 24 10 44 89 e6 89 c7 e8 64 e8 ff ff ff c3 83 > fb 04 0f 84 95 fe ff ff 4a 8b 14 f5 20 95 85 81 e9 68 ff ff ff <0f> 0b > eb fe 0f 0b eb fe 48 8b 1d fd 00 42 00 4c 8d 6c 24 20 eb > RIP [<ffffffff8140d574>] xen_irq_resume+0x224/0x370 > RSP <ffff88001f9fbce0> > ---[ end trace 82e2e97d58b5f835 ]---This seems to be taking the non-cancelled resume path, does this patch help at all: diff -r d7b14b76f1eb tools/libxl/libxl.c --- a/tools/libxl/libxl.c Thu Sep 22 14:26:08 2011 +0100 +++ b/tools/libxl/libxl.c Fri Sep 23 08:45:28 2011 +0100 @@ -246,7 +246,7 @@ int libxl_domain_resume(libxl_ctx *ctx, rc = ERROR_NI; goto out; } - if (xc_domain_resume(ctx->xch, domid, 0)) { + if (xc_domain_resume(ctx->xch, domid, 1)) { LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "xc_domain_resume failed for domain %u", domid); I don''t think that''s a solution but if this patch works then it may indicate a problem with xc_domain_resume_any. [...]> I am not quite sure what you mean by "guest log".The guest console log, which you provided above, thanks. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Olsowski
2011-Sep-23 09:15 UTC
Re: [Xen-devel] pv guests die after failed migration
On 09/23/2011 09:47 AM, Ian Campbell wrote:> This seems to be taking the non-cancelled resume path, does this patch > help at all: > > diff -r d7b14b76f1eb tools/libxl/libxl.c > --- a/tools/libxl/libxl.c Thu Sep 22 14:26:08 2011 +0100 > +++ b/tools/libxl/libxl.c Fri Sep 23 08:45:28 2011 +0100 > @@ -246,7 +246,7 @@ int libxl_domain_resume(libxl_ctx *ctx, > rc = ERROR_NI; > goto out; > } > - if (xc_domain_resume(ctx->xch, domid, 0)) { > + if (xc_domain_resume(ctx->xch, domid, 1)) { > LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, > "xc_domain_resume failed for domain %u", > domid); > > I don''t think that''s a solution but if this patch works then it may > indicate a problem with xc_domain_resume_any. >For the curent xen-4.1-testing.hg the patch had to be modified to a different position: --- a/tools/libxl/libxl.c Thu Sep 22 14:26:08 2011 +0100 +++ b/tools/libxl/libxl.c Fri Sep 23 08:45:28 2011 +0100 @@ -229,7 +229,7 @@ rc = ERROR_NI; goto out; } - if (xc_domain_resume(ctx->xch, domid, 0)) { + if (xc_domain_resume(ctx->xch, domid, 1)) { LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "xc_domain_resume failed for domain %u", domid); I did a clean/make/install after that, compilation worked fine. I then tested the migration towards an unsuitable target again and it did what you thought it could do. The guest resumes correctly at sender ############### root@xenturio1:~# xl -vvv migrate thishopefullywontfail xenturio2 Saving to migration stream new xl format (info 0x0/0x0/407) migration target: Ready to receive domain. Loading new save file incoming migration stream (new xl fmt info 0x0/0x0/407) Savefile contains xl domain config xc: detail: Had 0 unexplained entries in p2m table xc: Saving memory: iter 0 (last sent 0 skipped 0): 133120/133120 100% xc: detail: delta 9529ms, dom0 93%, target 1%, sent 449Mb/s, dirtied 1Mb/s 502 pages xc: Saving memory: iter 1 (last sent 130592 skipped 480): 133120/133120 100% xc: detail: delta 37ms, dom0 91%, target 2%, sent 444Mb/s, dirtied 30Mb/s 34 pages xc: Saving memory: iter 2 (last sent 502 skipped 0): 133120/133120 100% xc: detail: Start last iteration libxl: debug: libxl_dom.c:384:libxl__domain_suspend_common_callback issuing PV suspend request via XenBus control node libxl: debug: libxl_dom.c:389:libxl__domain_suspend_common_callback wait for the guest to acknowledge suspend request libxl: debug: libxl_dom.c:434:libxl__domain_suspend_common_callback guest acknowledged suspend request libxl: debug: libxl_dom.c:438:libxl__domain_suspend_common_callback wait for the guest to suspend libxl: debug: libxl_dom.c:450:libxl__domain_suspend_common_callback guest has suspended xc: detail: SUSPEND shinfo 0007fafc xc: detail: delta 204ms, dom0 3%, target 0%, sent 4Mb/s, dirtied 25Mb/s 156 pages xc: Saving memory: iter 3 (last sent 30 skipped 4): 133120/133120 100% xc: detail: delta 3ms, dom0 0%, target 0%, sent 1703Mb/s, dirtied 1703Mb/s 156 pages xc: detail: Total pages sent= 131280 (0.99x) xc: detail: (of which 0 were fixups) xc: detail: All memory is saved xc: detail: Save exit rc=0 libxl: error: libxl.c:900:validate_virtual_disk failed to stat /dev/xen-data/thishopefullywontfail-root: No such file or directory cannot add disk 0 to domain: -6 migration target: Domain creation failed (code -3). libxl: error: libxl_utils.c:408:libxl_read_exactly file/stream truncated reading ready message from migration receiver stream libxl: info: libxl_exec.c:72:libxl_report_child_exitstatus migration target process [16608] exited with error status 3 Migration failed, resuming at sender. root@xenturio1:~# xl console thishopefullywontfail PM: freeze of devices complete after 0.197 msecs PM: late freeze of devices complete after 0.067 msecs PM: early thaw of devices complete after 0.074 msecs PM: thaw of devices complete after 0.077 msecs root@thishopefullywontfail:~# ##################### So that works There is no mention of the migration failing in the guest log though, maybe when a final patch is made it should log that failing migration? with best regards andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, 2011-09-23 at 10:15 +0100, Andreas Olsowski wrote:> On 09/23/2011 09:47 AM, Ian Campbell wrote:[...]> I did a clean/make/install after that, compilation worked fine. > > I then tested the migration towards an unsuitable target again and it > did what you thought it could do. The guest resumes correctly at sender[...]> So that worksGreat, thanks for testing. I need to figure out how to automatically select which guests are capable of a cooperative resume and which are not.> There is no mention of the migration failing in the guest log though, > maybe when a final patch is made it should log that failing migration?Are you saying that you don''t see the "failed for domain %u" message immediately after the xc_domain_resume call? + if (xc_domain_resume(ctx->xch, domid, 1)) { LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "xc_domain_resume failed for domain %u", domid); That would be pretty odd... Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Olsowski
2011-Oct-15 01:18 UTC
Re: [Xen-devel] pv guests die after failed migration
It seems this still has not made it into 4.1-testing. root@memoryana:~# xl info |grep xen_extra xen_extra : .2-rc3 root@memoryana:~# xl -vv migrate testmig netcatarina migration target: Ready to receive domain. Saving to migration stream new xl format (info 0x0/0x0/365) Loading new save file incoming migration stream (new xl fmt info 0x0/0x0/365) Savefile contains xl domain config xc: detail: Had 0 unexplained entries in p2m table xc: Saving memory: iter 0 (last sent 0 skipped 0): 133120/133120 100% xc: detail: delta 8283ms, dom0 86%, target 0%, sent 516Mb/s, dirtied 2Mb/s 508 pages xc: Saving memory: iter 1 (last sent 130590 skipped 482): 133120/133120 100% xc: detail: delta 25ms, dom0 60%, target 0%, sent 665Mb/s, dirtied 44Mb/s 34 pages xc: Saving memory: iter 2 (last sent 508 skipped 0): 133120/133120 100% xc: detail: Start last iteration xc: detail: SUSPEND shinfo 000bee3c xc: detail: delta 204ms, dom0 3%, target 0%, sent 5Mb/s, dirtied 26Mb/s 162 pages xc: Saving memory: iter 3 (last sent 34 skipped 0): 133120/133120 100% xc: detail: delta 1ms, dom0 0%, target 0%, sent 5308Mb/s, dirtied 5308Mb/s 162 pages xc: detail: Total pages sent= 131294 (0.99x) xc: detail: (of which 0 were fixups) xc: detail: All memory is saved xc: detail: Save exit rc=0 libxl: error: libxl.c:900:validate_virtual_disk failed to stat /dev/xen-data/testmig-root: No such file or directory cannot add disk 0 to domain: -6 migration target: Domain creation failed (code -3). libxl: error: libxl_utils.c:408:libxl_read_exactly file/stream truncated reading ready message from migration receiver stream libxl: info: libxl_exec.c:72:libxl_report_child_exitstatus migration target process [13420] exited with error status 3 Migration failed, resuming at sender. root@memoryana:~# xl console testmig PM: freeze of devices complete after 0.099 msecs PM: late freeze of devices complete after 0.025 msecs ------------[ cut here ]------------ kernel BUG at drivers/xen/events.c:1466! invalid opcode: 0000 [#1] SMP CPU 0 Modules linked in: Pid: 6, comm: migration/0 Not tainted 3.0.4-xenU #6 RIP: e030:[<ffffffff8140d574>] [<ffffffff8140d574>] xen_irq_resume+0x224/0x370 RSP: e02b:ffff88001f9fbce0 EFLAGS: 00010082 RAX: ffffffffffffffef RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff88001f809ea8 RSI: ffff88001f9fbd00 RDI: 0000000000000001 RBP: 0000000000000010 R08: ffffffff81859a00 R09: 0000000000000000 R10: 0000000000000000 R11: 09f911029d74e35b R12: 0000000000000000 R13: 000000000000f0a0 R14: 0000000000000000 R15: ffff88001f9fbd00 FS: 00007f49f928b700(0000) GS:ffff88001fec6000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f89fb1a89f0 CR3: 000000001e4cf000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/0 (pid: 6, threadinfo ffff88001f9fa000, task ffff88001f9f7170) Stack: ffff88001f9fbd34 ffff88001f9fbd54 0000000000000003 000000000000f100 0000000000000000 0000000000000003 0000000000000000 0000000000000003 ffff88001fa6fdb0 ffffffff8140aa20 ffffffff81859a08 0000000000000000 Call Trace: [<ffffffff8140aa20>] ? gnttab_map+0x100/0x130 [<ffffffff815c2765>] ? _raw_spin_lock+0x5/0x10 [<ffffffff81083e01>] ? cpu_stopper_thread+0x101/0x190 [<ffffffff8140e1f5>] ? xen_suspend+0x75/0xa0 [<ffffffff81083f1b>] ? stop_machine_cpu_stop+0x8b/0xd0 [<ffffffff81083e90>] ? cpu_stopper_thread+0x190/0x190 [<ffffffff81083dd0>] ? cpu_stopper_thread+0xd0/0x190 [<ffffffff815c0870>] ? schedule+0x270/0x6c0 [<ffffffff81083d00>] ? copy_pid_ns+0x2a0/0x2a0 [<ffffffff81065846>] ? kthread+0x96/0xa0 [<ffffffff815c4024>] ? kernel_thread_helper+0x4/0x10 [<ffffffff815c3436>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff815c2be1>] ? retint_restore_args+0x5/0x6 [<ffffffff815c4020>] ? gs_change+0x13/0x13 Code: e8 f2 e9 ff ff 8b 44 24 10 44 89 e6 89 c7 e8 64 e8 ff ff ff c3 83 fb 04 0f 84 95 fe ff ff 4a 8b 14 f5 20 95 85 81 e9 68 ff ff ff <0f> 0b eb fe 0f 0b eb fe 48 8b 1d fd 00 42 00 4c 8d 6c 24 20 eb RIP [<ffffffff8140d574>] xen_irq_resume+0x224/0x370 RSP <ffff88001f9fbce0> ---[ end trace 67ddba38000aae42 ]--- -- Andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sat, 2011-10-15 at 02:18 +0100, Andreas Olsowski wrote:> It seems this still has not made it into 4.1-testing.I''m afraid I''ve not had time to "figure out how to automatically select which guests are capable of a cooperative resume and which are not." so it hasn''t been fixed in xen-unstable either AFAIK. I''m also still interested in confirmation to the question I asked in the mail you just replied to.> > > > root@memoryana:~# xl info |grep xen_extra > xen_extra : .2-rc3 > > root@memoryana:~# xl -vv migrate testmig netcatarina > migration target: Ready to receive domain. > Saving to migration stream new xl format (info 0x0/0x0/365) > Loading new save file incoming migration stream (new xl fmt info > 0x0/0x0/365) > Savefile contains xl domain config > xc: detail: Had 0 unexplained entries in p2m table > xc: Saving memory: iter 0 (last sent 0 skipped 0): 133120/133120 100% > xc: detail: delta 8283ms, dom0 86%, target 0%, sent 516Mb/s, dirtied > 2Mb/s 508 pages > xc: Saving memory: iter 1 (last sent 130590 skipped 482): 133120/133120 > 100% > xc: detail: delta 25ms, dom0 60%, target 0%, sent 665Mb/s, dirtied > 44Mb/s 34 pages > xc: Saving memory: iter 2 (last sent 508 skipped 0): 133120/133120 100% > > xc: detail: Start last iteration > xc: detail: SUSPEND shinfo 000bee3c > xc: detail: delta 204ms, dom0 3%, target 0%, sent 5Mb/s, dirtied 26Mb/s > 162 pages > xc: Saving memory: iter 3 (last sent 34 skipped 0): 133120/133120 100% > xc: detail: delta 1ms, dom0 0%, target 0%, sent 5308Mb/s, dirtied > 5308Mb/s 162 pages > xc: detail: Total pages sent= 131294 (0.99x) > xc: detail: (of which 0 were fixups) > xc: detail: All memory is saved > xc: detail: Save exit rc=0 > libxl: error: libxl.c:900:validate_virtual_disk failed to stat > /dev/xen-data/testmig-root: No such file or directory > cannot add disk 0 to domain: -6 > migration target: Domain creation failed (code -3). > libxl: error: libxl_utils.c:408:libxl_read_exactly file/stream truncated > reading ready message from migration receiver stream > libxl: info: libxl_exec.c:72:libxl_report_child_exitstatus migration > target process [13420] exited with error status 3 > Migration failed, resuming at sender. > root@memoryana:~# xl console testmig > PM: freeze of devices complete after 0.099 msecs > PM: late freeze of devices complete after 0.025 msecs > ------------[ cut here ]------------ > kernel BUG at drivers/xen/events.c:1466! > invalid opcode: 0000 [#1] SMP > CPU 0 > Modules linked in: > > Pid: 6, comm: migration/0 Not tainted 3.0.4-xenU #6 > RIP: e030:[<ffffffff8140d574>] [<ffffffff8140d574>] > xen_irq_resume+0x224/0x370 > RSP: e02b:ffff88001f9fbce0 EFLAGS: 00010082 > RAX: ffffffffffffffef RBX: 0000000000000000 RCX: 0000000000000000 > RDX: ffff88001f809ea8 RSI: ffff88001f9fbd00 RDI: 0000000000000001 > RBP: 0000000000000010 R08: ffffffff81859a00 R09: 0000000000000000 > R10: 0000000000000000 R11: 09f911029d74e35b R12: 0000000000000000 > R13: 000000000000f0a0 R14: 0000000000000000 R15: ffff88001f9fbd00 > FS: 00007f49f928b700(0000) GS:ffff88001fec6000(0000) knlGS:0000000000000000 > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00007f89fb1a89f0 CR3: 000000001e4cf000 CR4: 0000000000002660 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process migration/0 (pid: 6, threadinfo ffff88001f9fa000, task > ffff88001f9f7170) > Stack: > ffff88001f9fbd34 ffff88001f9fbd54 0000000000000003 000000000000f100 > 0000000000000000 0000000000000003 0000000000000000 0000000000000003 > ffff88001fa6fdb0 ffffffff8140aa20 ffffffff81859a08 0000000000000000 > Call Trace: > [<ffffffff8140aa20>] ? gnttab_map+0x100/0x130 > [<ffffffff815c2765>] ? _raw_spin_lock+0x5/0x10 > [<ffffffff81083e01>] ? cpu_stopper_thread+0x101/0x190 > [<ffffffff8140e1f5>] ? xen_suspend+0x75/0xa0 > [<ffffffff81083f1b>] ? stop_machine_cpu_stop+0x8b/0xd0 > [<ffffffff81083e90>] ? cpu_stopper_thread+0x190/0x190 > [<ffffffff81083dd0>] ? cpu_stopper_thread+0xd0/0x190 > [<ffffffff815c0870>] ? schedule+0x270/0x6c0 > [<ffffffff81083d00>] ? copy_pid_ns+0x2a0/0x2a0 > [<ffffffff81065846>] ? kthread+0x96/0xa0 > [<ffffffff815c4024>] ? kernel_thread_helper+0x4/0x10 > [<ffffffff815c3436>] ? int_ret_from_sys_call+0x7/0x1b > [<ffffffff815c2be1>] ? retint_restore_args+0x5/0x6 > [<ffffffff815c4020>] ? gs_change+0x13/0x13 > Code: e8 f2 e9 ff ff 8b 44 24 10 44 89 e6 89 c7 e8 64 e8 ff ff ff c3 83 > fb 04 0f 84 95 fe ff ff 4a 8b 14 f5 20 95 85 81 e9 68 ff ff ff <0f> 0b > eb fe 0f 0b eb fe 48 8b 1d fd 00 42 00 4c 8d 6c 24 20 eb > RIP [<ffffffff8140d574>] xen_irq_resume+0x224/0x370 > RSP <ffff88001f9fbce0> > ---[ end trace 67ddba38000aae42 ]--- > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Olsowski
2011-Oct-15 10:35 UTC
Re: [Xen-devel] pv guests die after failed migration
On 10/15/2011 07:44 AM, Ian Campbell wrote:> On Sat, 2011-10-15 at 02:18 +0100, Andreas Olsowski wrote: >> It seems this still has not made it into 4.1-testing. > > I''m afraid I''ve not had time to "figure out how to automatically select > which guests are capable of a cooperative resume and which are not." so > it hasn''t been fixed in xen-unstable either AFAIK. >Wouldnt just assuming all of them do fix a bigger percentage of systems than leaving it the way it is?> I''m also still interested in confirmation to the question I asked in the > mail you just replied to. >Oh, i thought i allready did. > Are you saying that you don''t see the "failed for domain %u" message > immediately after the xc_domain_resume call? > > + if (xc_domain_resume(ctx->xch, domid, 1)) { > LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, > "xc_domain_resume failed for domain %u", > domid); > > That would be pretty odd... Yes that is what i am saying: root@memoryana:/var/log/xen# cat xl-testmig.log Waiting for domain testmig (domid 2) to die [pid 13349] Domain 2 is dead Done. Exiting now -- Andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sat, 2011-10-15 at 11:35 +0100, Andreas Olsowski wrote:> On 10/15/2011 07:44 AM, Ian Campbell wrote: > > On Sat, 2011-10-15 at 02:18 +0100, Andreas Olsowski wrote: > >> It seems this still has not made it into 4.1-testing. > > > > I''m afraid I''ve not had time to "figure out how to automatically select > > which guests are capable of a cooperative resume and which are not." so > > it hasn''t been fixed in xen-unstable either AFAIK. > > > Wouldnt just assuming all of them do fix a bigger percentage of systems > than leaving it the way it is?I don''t know -- that''s really the point.> > I''m also still interested in confirmation to the question I asked in the > > mail you just replied to. > > > > Oh, i thought i allready did. > > > Are you saying that you don''t see the "failed for domain %u" message > > immediately after the xc_domain_resume call? > > > > + if (xc_domain_resume(ctx->xch, domid, 1)) { > > LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, > > "xc_domain_resume failed for domain %u", > > domid); > > > > That would be pretty odd... > > Yes that is what i am saying:Oh wait, that''s right -- as far as the toolstack is concerned the resume was successful -- the subsequent crash is within the guest. Ian.> > root@memoryana:/var/log/xen# cat xl-testmig.log > Waiting for domain testmig (domid 2) to die [pid 13349] > Domain 2 is dead > Done. Exiting now >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Has any progress been made regarding this issue?> On Sat, 2011-10-15 at 11:35 +0100, Andreas Olsowski wrote: >> On 10/15/2011 07:44 AM, Ian Campbell wrote: >>> On Sat, 2011-10-15 at 02:18 +0100, Andreas Olsowski wrote: >>>> It seems this still has not made it into 4.1-testing. >>> I''m afraid I''ve not had time to "figure out how to automatically select >>> which guests are capable of a cooperative resume and which are not." so >>> it hasn''t been fixed in xen-unstable either AFAIK. >>> >> Wouldnt just assuming all of them do fix a bigger percentage of systems >> than leaving it the way it is? > I don''t know -- that''s really the point. > >>> I''m also still interested in confirmation to the question I asked in the >>> mail you just replied to. >>> >> Oh, i thought i allready did. >> >> > Are you saying that you don''t see the "failed for domain %u" message >> > immediately after the xc_domain_resume call? >> > >> > + if (xc_domain_resume(ctx->xch, domid, 1)) { >> > LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, >> > "xc_domain_resume failed for domain %u", >> > domid); >> > >> > That would be pretty odd... >> >> Yes that is what i am saying: > Oh wait, that''s right -- as far as the toolstack is concerned the resume > was successful -- the subsequent crash is within the guest. > > Ian. >Andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel