thr3ads.net - Xen devel - [Xen-devel] pv guests die after failed migration [Sep 2011]

If this information is useful, please help other people find it:
Share via:

Andreas Olsowski

2011-Sep-20 09:42 UTC

[Xen-devel] pv guests die after failed migration

When xm failed to do a live migration the system was resumed on the 
sending host.
Xl does not do that it TRIES to, but just crashes the guest:
kernel BUG at drivers/xen/events.c:1466!
(In this example the target host didnt have the logical volume activated.)

Now that cant be right.
Imho xl should do some checking if the target is a viable migration 
target (are the disks and vifs there, is there enough memory available) 
and preserve a safe state on the sender until the guest has properly 
started on the receiver.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Sep-20 19:27 UTC

head link

Re: [Xen-devel] pv guests die after failed migration

On Tue, 2011-09-20 at 10:42 +0100, Andreas Olsowski
wrote:> When xm failed to do a live migration the system was resumed on the 
> sending host.
> Xl does not do that it TRIES to, but just crashes the guest:
> kernel BUG at drivers/xen/events.c:1466!
> (In this example the target host didnt have the logical volume activated.)
> 
> Now that cant be right.
No, it''s a bug, perhaps in the kernel but likely in the toolstack.

Please can you provide full logs from /var/log/xen on both ends. Running
"xl -vvv migrate" will also produce more stuff on stdout, some of
which
may be useful.

Also please capture the complete guest log in case it is an issue there.
> Imho xl should do some checking if the target is a viable migration 
> target (are the disks and vifs there, is there enough memory available) 
> and preserve a safe state on the sender until the guest has properly 
> started on the receiver.
xl does have some checks and does preserve the sender side until it gets
confirmation of correct restart, but obviously something is wrong with
the bit which restarts the old guest. I''m sure it worked at one point
though.

Thanks,
Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andreas Olsowski

2011-Sep-23 07:39 UTC

head link

Re: [Xen-devel] pv guests die after failed migration

Here is the full procedure:


Preparations:

root@xenturio1:/var/log/xen# dmsetup ls |grep thiswillfail
xen--data-thiswillfail--swap    (252, 236)
xen--data-thiswillfail--root    (252, 235)

root@xenturio2:/var/log/xen# dmsetup ls |grep thiswillfail

 >Server 2 does not have the logical volumes activated.



root@xenturio1:/usr/src/linux-2.6-xen# xl create 
/mnt/vmctrl/xenconfig/thiswillfail.sxp
Parsing config file /mnt/vmctrl/xenconfig/thiswillfail.sxp
Daemon running with PID 6722

 >it is in fact running with pid 6723:

root@xenturio1:/usr/src/linux-2.6-xen# ps auxww |grep "xl create"
root      6723  0.0  0.0  35616   972 ?        Ssl  09:14   0:00 xl 
create /mnt/vmctrl/xenconfig/thiswillfail.sxp


 >Lets check the logfiles
root@xenturio1:/var/log/xen# cat xen-hotplug.log
RTNETLINK answers: Operation not supported
RTNETLINK answers: Operation not supported

 >stupid netlink again, no matter what stuff i load into the kernel that
 >still pops up ... annoying ... anyway, its a non-issue in this case

root@xenturio1:/var/log/xen# cat xl-thiswillfail.log
Waiting for domain thiswillfail (domid 5) to die [pid 6723]

 >Lets not make it wait any longer ;)

root@xenturio1:/usr/src/linux-2.6-xen# xl -vvv migrate thiswillfail 
xenturio2
migration target: Ready to receive domain.
Saving to migration stream new xl format (info 0x0/0x0/380)
Loading new save file incoming migration stream (new xl fmt info 
0x0/0x0/380)
  Savefile contains xl domain config
xc: detail: Had 0 unexplained entries in p2m table
xc: Saving memory: iter 0 (last sent 0 skipped 0): 133120/133120  100%
xc: detail: delta 9499ms, dom0 88%, target 2%, sent 451Mb/s, dirtied 
1Mb/s 324 pages
xc: Saving memory: iter 1 (last sent 130760 skipped 312): 133120/133120 
  100%
xc: detail: delta 23ms, dom0 91%, target 0%, sent 455Mb/s, dirtied 
48Mb/s 34 pages
xc: Saving memory: iter 2 (last sent 320 skipped 4): 133120/133120  100%
xc: detail: Start last iteration
libxl: debug: libxl_dom.c:384:libxl__domain_suspend_common_callback 
issuing PV suspend request via XenBus control node
libxl: debug: libxl_dom.c:389:libxl__domain_suspend_common_callback wait 
for the guest to acknowledge suspend request
libxl: debug: libxl_dom.c:434:libxl__domain_suspend_common_callback 
guest acknowledged suspend request
libxl: debug: libxl_dom.c:438:libxl__domain_suspend_common_callback wait 
for the guest to suspend
libxl: debug: libxl_dom.c:450:libxl__domain_suspend_common_callback 
guest has suspended
xc: detail: SUSPEND shinfo 0007fafc
xc: detail: delta 206ms, dom0 2%, target 0%, sent 4Mb/s, dirtied 24Mb/s 
154 pages
xc: Saving memory: iter 3 (last sent 30 skipped 4): 133120/133120  100%
xc: detail: delta 3ms, dom0 0%, target 0%, sent 1682Mb/s, dirtied 
1682Mb/s 154 pages
xc: detail: Total pages sent= 131264 (0.99x)
xc: detail: (of which 0 were fixups)
xc: detail: All memory is saved
xc: detail: Save exit rc=0
libxl: error: libxl.c:900:validate_virtual_disk failed to stat 
/dev/xen-data/thiswillfail-root: No such file or directory
cannot add disk 0 to domain: -6
migration target: Domain creation failed (code -3).
libxl: error: libxl_utils.c:408:libxl_read_exactly file/stream truncated 
reading ready message from migration receiver stream
libxl: info: libxl_exec.c:72:libxl_report_child_exitstatus migration 
target process [6837] exited with error status 3
Migration failed, resuming at sender.




 >Now see if it really is resumed at sender:

root@xenturio1:/usr/src/linux-2.6-xen# xl console thiswillfail
PM: freeze of devices complete after 0.207 msecs
PM: late freeze of devices complete after 0.058 msecs
------------[ cut here ]------------
kernel BUG at drivers/xen/events.c:1466!
invalid opcode: 0000 [#1] SMP
CPU 0
Modules linked in:

Pid: 6, comm: migration/0 Not tainted 3.0.4-xenU #6
RIP: e030:[<ffffffff8140d574>]  [<ffffffff8140d574>] 
xen_irq_resume+0x224/0x370
RSP: e02b:ffff88001f9fbce0  EFLAGS: 00010082
RAX: ffffffffffffffef RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff88001f809ea8 RSI: ffff88001f9fbd00 RDI: 0000000000000001
RBP: 0000000000000010 R08: ffffffff81859a00 R09: 0000000000000000
R10: 0000000000000000 R11: 09f911029d74e35b R12: 0000000000000000
R13: 000000000000f0a0 R14: 0000000000000000 R15: ffff88001f9fbd00
FS:  00007ff28f8c8700(0000) GS:ffff88001fec6000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fff02056048 CR3: 000000001e4d8000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/0 (pid: 6, threadinfo ffff88001f9fa000, task 
ffff88001f9f7170)
Stack:
  ffff88001f9fbd34 ffff88001f9fbd54 0000000000000003 000000000000f100
  0000000000000000 0000000000000003 0000000000000000 0000000000000003
  ffff88001fa6ddb0 ffffffff8140aa20 ffffffff81859a08 0000000000000000
Call Trace:
  [<ffffffff8140aa20>] ? gnttab_map+0x100/0x130
  [<ffffffff815c2765>] ? _raw_spin_lock+0x5/0x10
  [<ffffffff81083e01>] ? cpu_stopper_thread+0x101/0x190
  [<ffffffff8140e1f5>] ? xen_suspend+0x75/0xa0
  [<ffffffff81083f1b>] ? stop_machine_cpu_stop+0x8b/0xd0
  [<ffffffff81083e90>] ? cpu_stopper_thread+0x190/0x190
  [<ffffffff81083dd0>] ? cpu_stopper_thread+0xd0/0x190
  [<ffffffff815c0870>] ? schedule+0x270/0x6c0
  [<ffffffff81083d00>] ? copy_pid_ns+0x2a0/0x2a0
  [<ffffffff81065846>] ? kthread+0x96/0xa0
  [<ffffffff815c4024>] ? kernel_thread_helper+0x4/0x10
  [<ffffffff815c3436>] ? int_ret_from_sys_call+0x7/0x1b
  [<ffffffff815c2be1>] ? retint_restore_args+0x5/0x6
  [<ffffffff815c4020>] ? gs_change+0x13/0x13
Code: e8 f2 e9 ff ff 8b 44 24 10 44 89 e6 89 c7 e8 64 e8 ff ff ff c3 83 
fb 04 0f 84 95 fe ff ff 4a 8b 14 f5 20 95 85 81 e9 68 ff ff ff <0f> 0b 
eb fe 0f 0b eb fe 48 8b 1d fd 00 42 00 4c 8d 6c 24 20 eb
RIP  [<ffffffff8140d574>] xen_irq_resume+0x224/0x370
  RSP <ffff88001f9fbce0>
---[ end trace 82e2e97d58b5f835 ]---


 > And here are the new versions of /var/log/xen

root@xenturio1:/var/log/xen# cat xl-thiswillfail.log
Waiting for domain thiswillfail (domid 5) to die [pid 6723]
Domain 5 is dead
Done. Exiting now

 >target servers /var/log/xen remains empty



And that, was 3.0.4-xenU, same goes for 2.6.39-xenU.

 > Please can you provide full logs from /var/log/xen on both ends. Running
 > "xl -vvv migrate" will also produce more stuff on stdout, some
of which
 > may be useful.
 >
 > Also please capture the complete guest log in case it is an issue there.

I am not quite sure what you mean by "guest log".


When you reply to this i should be much quicker to respond, had a hell 
of a week and didnt really get to check my list-mail until yesterday 
evening.

I guess anyone with 2 machines running xen should easily be able to 
reproduce this problem.




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Sep-23 07:47 UTC

head link

Re: [Xen-devel] pv guests die after failed migration

On Fri, 2011-09-23 at 08:39 +0100, Andreas Olsowski
wrote:> Here is the full procedure:[...]

Thanks, I should be able to figure out a repro with this, although I may
not get to it straight away.
> root@xenturio1:/usr/src/linux-2.6-xen# xl console thiswillfail
> PM: freeze of devices complete after 0.207 msecs
> PM: late freeze of devices complete after 0.058 msecs
> ------------[ cut here ]------------
> kernel BUG at drivers/xen/events.c:1466!
> invalid opcode: 0000 [#1] SMP
> CPU 0
> Modules linked in:
> 
> Pid: 6, comm: migration/0 Not tainted 3.0.4-xenU #6
> RIP: e030:[<ffffffff8140d574>]  [<ffffffff8140d574>] 
> xen_irq_resume+0x224/0x370
> RSP: e02b:ffff88001f9fbce0  EFLAGS: 00010082
> RAX: ffffffffffffffef RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ffff88001f809ea8 RSI: ffff88001f9fbd00 RDI: 0000000000000001
> RBP: 0000000000000010 R08: ffffffff81859a00 R09: 0000000000000000
> R10: 0000000000000000 R11: 09f911029d74e35b R12: 0000000000000000
> R13: 000000000000f0a0 R14: 0000000000000000 R15: ffff88001f9fbd00
> FS:  00007ff28f8c8700(0000) GS:ffff88001fec6000(0000)
knlGS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007fff02056048 CR3: 000000001e4d8000 CR4: 0000000000002660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process migration/0 (pid: 6, threadinfo ffff88001f9fa000, task 
> ffff88001f9f7170)
> Stack:
>   ffff88001f9fbd34 ffff88001f9fbd54 0000000000000003 000000000000f100
>   0000000000000000 0000000000000003 0000000000000000 0000000000000003
>   ffff88001fa6ddb0 ffffffff8140aa20 ffffffff81859a08 0000000000000000
> Call Trace:
>   [<ffffffff8140aa20>] ? gnttab_map+0x100/0x130
>   [<ffffffff815c2765>] ? _raw_spin_lock+0x5/0x10
>   [<ffffffff81083e01>] ? cpu_stopper_thread+0x101/0x190
>   [<ffffffff8140e1f5>] ? xen_suspend+0x75/0xa0
>   [<ffffffff81083f1b>] ? stop_machine_cpu_stop+0x8b/0xd0
>   [<ffffffff81083e90>] ? cpu_stopper_thread+0x190/0x190
>   [<ffffffff81083dd0>] ? cpu_stopper_thread+0xd0/0x190
>   [<ffffffff815c0870>] ? schedule+0x270/0x6c0
>   [<ffffffff81083d00>] ? copy_pid_ns+0x2a0/0x2a0
>   [<ffffffff81065846>] ? kthread+0x96/0xa0
>   [<ffffffff815c4024>] ? kernel_thread_helper+0x4/0x10
>   [<ffffffff815c3436>] ? int_ret_from_sys_call+0x7/0x1b
>   [<ffffffff815c2be1>] ? retint_restore_args+0x5/0x6
>   [<ffffffff815c4020>] ? gs_change+0x13/0x13
> Code: e8 f2 e9 ff ff 8b 44 24 10 44 89 e6 89 c7 e8 64 e8 ff ff ff c3 83 
> fb 04 0f 84 95 fe ff ff 4a 8b 14 f5 20 95 85 81 e9 68 ff ff ff <0f>
0b
> eb fe 0f 0b eb fe 48 8b 1d fd 00 42 00 4c 8d 6c 24 20 eb
> RIP  [<ffffffff8140d574>] xen_irq_resume+0x224/0x370
>   RSP <ffff88001f9fbce0>
> ---[ end trace 82e2e97d58b5f835 ]---
This seems to be taking the non-cancelled resume path, does this patch
help at all:

diff -r d7b14b76f1eb tools/libxl/libxl.c
--- a/tools/libxl/libxl.c	Thu Sep 22 14:26:08 2011 +0100
+++ b/tools/libxl/libxl.c	Fri Sep 23 08:45:28 2011 +0100
@@ -246,7 +246,7 @@ int libxl_domain_resume(libxl_ctx *ctx, 
         rc = ERROR_NI;
         goto out;
     }
-    if (xc_domain_resume(ctx->xch, domid, 0)) {
+    if (xc_domain_resume(ctx->xch, domid, 1)) {
         LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
                         "xc_domain_resume failed for domain %u",
                         domid);

I don''t think that''s a solution but if this patch works then
it may
indicate a problem with xc_domain_resume_any.

[...]> I am not quite sure what you mean by "guest log".
The guest console log, which you provided above, thanks.

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andreas Olsowski

2011-Sep-23 09:15 UTC

head link

Re: [Xen-devel] pv guests die after failed migration

On 09/23/2011 09:47 AM, Ian Campbell wrote:> This seems to be taking the non-cancelled resume path, does this patch
> help at all:
>
> diff -r d7b14b76f1eb tools/libxl/libxl.c
> --- a/tools/libxl/libxl.c	Thu Sep 22 14:26:08 2011 +0100
> +++ b/tools/libxl/libxl.c	Fri Sep 23 08:45:28 2011 +0100
> @@ -246,7 +246,7 @@ int libxl_domain_resume(libxl_ctx *ctx,
>           rc = ERROR_NI;
>           goto out;
>       }
> -    if (xc_domain_resume(ctx->xch, domid, 0)) {
> +    if (xc_domain_resume(ctx->xch, domid, 1)) {
>           LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
>                           "xc_domain_resume failed for domain
%u",
>                           domid);
>
> I don''t think that''s a solution but if this patch works
then it may
> indicate a problem with xc_domain_resume_any.
>
For the curent xen-4.1-testing.hg the patch had to be modified to a 
different position:

--- a/tools/libxl/libxl.c  Thu Sep 22 14:26:08 2011 +0100
+++ b/tools/libxl/libxl.c  Fri Sep 23 08:45:28 2011 +0100
@@ -229,7 +229,7 @@
          rc = ERROR_NI;
          goto out;
      }
-    if (xc_domain_resume(ctx->xch, domid, 0)) {
+    if (xc_domain_resume(ctx->xch, domid, 1)) {
          LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
                          "xc_domain_resume failed for domain %u",
                          domid);

I did a clean/make/install after that, compilation worked fine.

I then tested the migration towards an unsuitable target again and it 
did what you thought it could do. The guest resumes correctly at sender

###############
root@xenturio1:~# xl -vvv migrate thishopefullywontfail xenturio2
Saving to migration stream new xl format (info 0x0/0x0/407)
migration target: Ready to receive domain.
Loading new save file incoming migration stream (new xl fmt info 
0x0/0x0/407)
  Savefile contains xl domain config
xc: detail: Had 0 unexplained entries in p2m table
xc: Saving memory: iter 0 (last sent 0 skipped 0): 133120/133120  100%
xc: detail: delta 9529ms, dom0 93%, target 1%, sent 449Mb/s, dirtied 
1Mb/s 502 pages
xc: Saving memory: iter 1 (last sent 130592 skipped 480): 133120/133120 
  100%
xc: detail: delta 37ms, dom0 91%, target 2%, sent 444Mb/s, dirtied 
30Mb/s 34 pages
xc: Saving memory: iter 2 (last sent 502 skipped 0): 133120/133120  100% 

xc: detail: Start last iteration
libxl: debug: libxl_dom.c:384:libxl__domain_suspend_common_callback 
issuing PV suspend request via XenBus control node
libxl: debug: libxl_dom.c:389:libxl__domain_suspend_common_callback wait 
for the guest to acknowledge suspend request
libxl: debug: libxl_dom.c:434:libxl__domain_suspend_common_callback 
guest acknowledged suspend request
libxl: debug: libxl_dom.c:438:libxl__domain_suspend_common_callback wait 
for the guest to suspend
libxl: debug: libxl_dom.c:450:libxl__domain_suspend_common_callback 
guest has suspended
xc: detail: SUSPEND shinfo 0007fafc
xc: detail: delta 204ms, dom0 3%, target 0%, sent 4Mb/s, dirtied 25Mb/s 
156 pages
xc: Saving memory: iter 3 (last sent 30 skipped 4): 133120/133120  100%
xc: detail: delta 3ms, dom0 0%, target 0%, sent 1703Mb/s, dirtied 
1703Mb/s 156 pages
xc: detail: Total pages sent= 131280 (0.99x)
xc: detail: (of which 0 were fixups)
xc: detail: All memory is saved
xc: detail: Save exit rc=0
libxl: error: libxl.c:900:validate_virtual_disk failed to stat 
/dev/xen-data/thishopefullywontfail-root: No such file or directory
cannot add disk 0 to domain: -6
migration target: Domain creation failed (code -3).
libxl: error: libxl_utils.c:408:libxl_read_exactly file/stream truncated 
reading ready message from migration receiver stream
libxl: info: libxl_exec.c:72:libxl_report_child_exitstatus migration 
target process [16608] exited with error status 3
Migration failed, resuming at sender.

root@xenturio1:~# xl console thishopefullywontfail
PM: freeze of devices complete after 0.197 msecs
PM: late freeze of devices complete after 0.067 msecs
PM: early thaw of devices complete after 0.074 msecs
PM: thaw of devices complete after 0.077 msecs

root@thishopefullywontfail:~#
#####################

So that works

There is no mention of the migration failing in the guest log though, 
maybe when a final patch is made it should log that failing migration?

with best regards

andreas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Sep-28 15:52 UTC

head link

Re: [Xen-devel] pv guests die after failed migration

On Fri, 2011-09-23 at 10:15 +0100, Andreas Olsowski
wrote:> On 09/23/2011 09:47 AM, Ian Campbell wrote:
[...]> I did a clean/make/install after that, compilation worked fine.
> 
> I then tested the migration towards an unsuitable target again and it 
> did what you thought it could do. The guest resumes correctly at sender
[...]> So that works
Great, thanks for testing.

I need to figure out how to automatically select which guests are
capable of a cooperative resume and which are not.
> There is no mention of the migration failing in the guest log though, 
> maybe when a final patch is made it should log that failing migration?
Are you saying that you don''t see the "failed for domain %u"
message
immediately after the xc_domain_resume call?

+    if (xc_domain_resume(ctx->xch, domid, 1)) {
          LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
                          "xc_domain_resume failed for domain %u",
                          domid);

That would be pretty odd...

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andreas Olsowski

2011-Oct-15 01:18 UTC

head link

Re: [Xen-devel] pv guests die after failed migration

It seems this still has not made it into 4.1-testing.



root@memoryana:~# xl info |grep xen_extra
xen_extra              : .2-rc3

root@memoryana:~# xl -vv migrate testmig netcatarina
migration target: Ready to receive domain.
Saving to migration stream new xl format (info 0x0/0x0/365)
Loading new save file incoming migration stream (new xl fmt info 
0x0/0x0/365)
  Savefile contains xl domain config
xc: detail: Had 0 unexplained entries in p2m table
xc: Saving memory: iter 0 (last sent 0 skipped 0): 133120/133120  100%
xc: detail: delta 8283ms, dom0 86%, target 0%, sent 516Mb/s, dirtied 
2Mb/s 508 pages
xc: Saving memory: iter 1 (last sent 130590 skipped 482): 133120/133120 
  100%
xc: detail: delta 25ms, dom0 60%, target 0%, sent 665Mb/s, dirtied 
44Mb/s 34 pages
xc: Saving memory: iter 2 (last sent 508 skipped 0): 133120/133120  100% 

xc: detail: Start last iteration
xc: detail: SUSPEND shinfo 000bee3c
xc: detail: delta 204ms, dom0 3%, target 0%, sent 5Mb/s, dirtied 26Mb/s 
162 pages
xc: Saving memory: iter 3 (last sent 34 skipped 0): 133120/133120  100%
xc: detail: delta 1ms, dom0 0%, target 0%, sent 5308Mb/s, dirtied 
5308Mb/s 162 pages
xc: detail: Total pages sent= 131294 (0.99x)
xc: detail: (of which 0 were fixups)
xc: detail: All memory is saved
xc: detail: Save exit rc=0
libxl: error: libxl.c:900:validate_virtual_disk failed to stat 
/dev/xen-data/testmig-root: No such file or directory
cannot add disk 0 to domain: -6
migration target: Domain creation failed (code -3).
libxl: error: libxl_utils.c:408:libxl_read_exactly file/stream truncated 
reading ready message from migration receiver stream
libxl: info: libxl_exec.c:72:libxl_report_child_exitstatus migration 
target process [13420] exited with error status 3
Migration failed, resuming at sender.
root@memoryana:~# xl console testmig
PM: freeze of devices complete after 0.099 msecs
PM: late freeze of devices complete after 0.025 msecs
------------[ cut here ]------------
kernel BUG at drivers/xen/events.c:1466!
invalid opcode: 0000 [#1] SMP
CPU 0
Modules linked in:

Pid: 6, comm: migration/0 Not tainted 3.0.4-xenU #6
RIP: e030:[<ffffffff8140d574>]  [<ffffffff8140d574>] 
xen_irq_resume+0x224/0x370
RSP: e02b:ffff88001f9fbce0  EFLAGS: 00010082
RAX: ffffffffffffffef RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff88001f809ea8 RSI: ffff88001f9fbd00 RDI: 0000000000000001
RBP: 0000000000000010 R08: ffffffff81859a00 R09: 0000000000000000
R10: 0000000000000000 R11: 09f911029d74e35b R12: 0000000000000000
R13: 000000000000f0a0 R14: 0000000000000000 R15: ffff88001f9fbd00
FS:  00007f49f928b700(0000) GS:ffff88001fec6000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f89fb1a89f0 CR3: 000000001e4cf000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process migration/0 (pid: 6, threadinfo ffff88001f9fa000, task 
ffff88001f9f7170)
Stack:
  ffff88001f9fbd34 ffff88001f9fbd54 0000000000000003 000000000000f100
  0000000000000000 0000000000000003 0000000000000000 0000000000000003
  ffff88001fa6fdb0 ffffffff8140aa20 ffffffff81859a08 0000000000000000
Call Trace:
  [<ffffffff8140aa20>] ? gnttab_map+0x100/0x130
  [<ffffffff815c2765>] ? _raw_spin_lock+0x5/0x10
  [<ffffffff81083e01>] ? cpu_stopper_thread+0x101/0x190
  [<ffffffff8140e1f5>] ? xen_suspend+0x75/0xa0
  [<ffffffff81083f1b>] ? stop_machine_cpu_stop+0x8b/0xd0
  [<ffffffff81083e90>] ? cpu_stopper_thread+0x190/0x190
  [<ffffffff81083dd0>] ? cpu_stopper_thread+0xd0/0x190
  [<ffffffff815c0870>] ? schedule+0x270/0x6c0
  [<ffffffff81083d00>] ? copy_pid_ns+0x2a0/0x2a0
  [<ffffffff81065846>] ? kthread+0x96/0xa0
  [<ffffffff815c4024>] ? kernel_thread_helper+0x4/0x10
  [<ffffffff815c3436>] ? int_ret_from_sys_call+0x7/0x1b
  [<ffffffff815c2be1>] ? retint_restore_args+0x5/0x6
  [<ffffffff815c4020>] ? gs_change+0x13/0x13
Code: e8 f2 e9 ff ff 8b 44 24 10 44 89 e6 89 c7 e8 64 e8 ff ff ff c3 83 
fb 04 0f 84 95 fe ff ff 4a 8b 14 f5 20 95 85 81 e9 68 ff ff ff <0f> 0b 
eb fe 0f 0b eb fe 48 8b 1d fd 00 42 00 4c 8d 6c 24 20 eb
RIP  [<ffffffff8140d574>] xen_irq_resume+0x224/0x370
  RSP <ffff88001f9fbce0>
---[ end trace 67ddba38000aae42 ]---



-- 
Andreas



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Oct-15 05:44 UTC

head link

Re: [Xen-devel] pv guests die after failed migration

On Sat, 2011-10-15 at 02:18 +0100, Andreas Olsowski
wrote:> It seems this still has not made it into 4.1-testing.
I''m afraid I''ve not had time to "figure out how to
automatically select
which guests are capable of a cooperative resume and which are not." so
it hasn''t been fixed in xen-unstable either AFAIK.

I''m also still interested in confirmation to the question I asked in
the
mail you just replied to.
> 
> 
> 
> root@memoryana:~# xl info |grep xen_extra
> xen_extra              : .2-rc3
> 
> root@memoryana:~# xl -vv migrate testmig netcatarina
> migration target: Ready to receive domain.
> Saving to migration stream new xl format (info 0x0/0x0/365)
> Loading new save file incoming migration stream (new xl fmt info 
> 0x0/0x0/365)
>   Savefile contains xl domain config
> xc: detail: Had 0 unexplained entries in p2m table
> xc: Saving memory: iter 0 (last sent 0 skipped 0): 133120/133120  100%
> xc: detail: delta 8283ms, dom0 86%, target 0%, sent 516Mb/s, dirtied 
> 2Mb/s 508 pages
> xc: Saving memory: iter 1 (last sent 130590 skipped 482): 133120/133120 
>   100%
> xc: detail: delta 25ms, dom0 60%, target 0%, sent 665Mb/s, dirtied 
> 44Mb/s 34 pages
> xc: Saving memory: iter 2 (last sent 508 skipped 0): 133120/133120  100% 
> 
> xc: detail: Start last iteration
> xc: detail: SUSPEND shinfo 000bee3c
> xc: detail: delta 204ms, dom0 3%, target 0%, sent 5Mb/s, dirtied 26Mb/s 
> 162 pages
> xc: Saving memory: iter 3 (last sent 34 skipped 0): 133120/133120  100%
> xc: detail: delta 1ms, dom0 0%, target 0%, sent 5308Mb/s, dirtied 
> 5308Mb/s 162 pages
> xc: detail: Total pages sent= 131294 (0.99x)
> xc: detail: (of which 0 were fixups)
> xc: detail: All memory is saved
> xc: detail: Save exit rc=0
> libxl: error: libxl.c:900:validate_virtual_disk failed to stat 
> /dev/xen-data/testmig-root: No such file or directory
> cannot add disk 0 to domain: -6
> migration target: Domain creation failed (code -3).
> libxl: error: libxl_utils.c:408:libxl_read_exactly file/stream truncated 
> reading ready message from migration receiver stream
> libxl: info: libxl_exec.c:72:libxl_report_child_exitstatus migration 
> target process [13420] exited with error status 3
> Migration failed, resuming at sender.
> root@memoryana:~# xl console testmig
> PM: freeze of devices complete after 0.099 msecs
> PM: late freeze of devices complete after 0.025 msecs
> ------------[ cut here ]------------
> kernel BUG at drivers/xen/events.c:1466!
> invalid opcode: 0000 [#1] SMP
> CPU 0
> Modules linked in:
> 
> Pid: 6, comm: migration/0 Not tainted 3.0.4-xenU #6
> RIP: e030:[<ffffffff8140d574>]  [<ffffffff8140d574>] 
> xen_irq_resume+0x224/0x370
> RSP: e02b:ffff88001f9fbce0  EFLAGS: 00010082
> RAX: ffffffffffffffef RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ffff88001f809ea8 RSI: ffff88001f9fbd00 RDI: 0000000000000001
> RBP: 0000000000000010 R08: ffffffff81859a00 R09: 0000000000000000
> R10: 0000000000000000 R11: 09f911029d74e35b R12: 0000000000000000
> R13: 000000000000f0a0 R14: 0000000000000000 R15: ffff88001f9fbd00
> FS:  00007f49f928b700(0000) GS:ffff88001fec6000(0000)
knlGS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007f89fb1a89f0 CR3: 000000001e4cf000 CR4: 0000000000002660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process migration/0 (pid: 6, threadinfo ffff88001f9fa000, task 
> ffff88001f9f7170)
> Stack:
>   ffff88001f9fbd34 ffff88001f9fbd54 0000000000000003 000000000000f100
>   0000000000000000 0000000000000003 0000000000000000 0000000000000003
>   ffff88001fa6fdb0 ffffffff8140aa20 ffffffff81859a08 0000000000000000
> Call Trace:
>   [<ffffffff8140aa20>] ? gnttab_map+0x100/0x130
>   [<ffffffff815c2765>] ? _raw_spin_lock+0x5/0x10
>   [<ffffffff81083e01>] ? cpu_stopper_thread+0x101/0x190
>   [<ffffffff8140e1f5>] ? xen_suspend+0x75/0xa0
>   [<ffffffff81083f1b>] ? stop_machine_cpu_stop+0x8b/0xd0
>   [<ffffffff81083e90>] ? cpu_stopper_thread+0x190/0x190
>   [<ffffffff81083dd0>] ? cpu_stopper_thread+0xd0/0x190
>   [<ffffffff815c0870>] ? schedule+0x270/0x6c0
>   [<ffffffff81083d00>] ? copy_pid_ns+0x2a0/0x2a0
>   [<ffffffff81065846>] ? kthread+0x96/0xa0
>   [<ffffffff815c4024>] ? kernel_thread_helper+0x4/0x10
>   [<ffffffff815c3436>] ? int_ret_from_sys_call+0x7/0x1b
>   [<ffffffff815c2be1>] ? retint_restore_args+0x5/0x6
>   [<ffffffff815c4020>] ? gs_change+0x13/0x13
> Code: e8 f2 e9 ff ff 8b 44 24 10 44 89 e6 89 c7 e8 64 e8 ff ff ff c3 83 
> fb 04 0f 84 95 fe ff ff 4a 8b 14 f5 20 95 85 81 e9 68 ff ff ff <0f>
0b
> eb fe 0f 0b eb fe 48 8b 1d fd 00 42 00 4c 8d 6c 24 20 eb
> RIP  [<ffffffff8140d574>] xen_irq_resume+0x224/0x370
>   RSP <ffff88001f9fbce0>
> ---[ end trace 67ddba38000aae42 ]---
> 
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andreas Olsowski

2011-Oct-15 10:35 UTC

head link

Re: [Xen-devel] pv guests die after failed migration

On 10/15/2011 07:44 AM, Ian Campbell wrote:> On Sat, 2011-10-15 at 02:18 +0100, Andreas Olsowski wrote:
>> It seems this still has not made it into 4.1-testing.
>
> I''m afraid I''ve not had time to "figure out how to
automatically select
> which guests are capable of a cooperative resume and which are not."
so
> it hasn''t been fixed in xen-unstable either AFAIK.
>Wouldnt just assuming all of them do fix a bigger percentage of systems 
than leaving it the way it is?
> I''m also still interested in confirmation to the question I asked
in the
> mail you just replied to.
>
Oh, i thought i allready did.

 > Are you saying that you don''t see the "failed for domain
%u" message
 > immediately after the xc_domain_resume call?
 >
 > +    if (xc_domain_resume(ctx->xch, domid, 1)) {
 >            LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
 >                            "xc_domain_resume failed for domain
%u",
 >                            domid);
 >
 > That would be pretty odd...

Yes that is what i am saying:

root@memoryana:/var/log/xen# cat xl-testmig.log
Waiting for domain testmig (domid 2) to die [pid 13349]
Domain 2 is dead
Done. Exiting now

-- 
Andreas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-Oct-15 15:12 UTC

head link

Re: [Xen-devel] pv guests die after failed migration

On Sat, 2011-10-15 at 11:35 +0100, Andreas Olsowski
wrote:> On 10/15/2011 07:44 AM, Ian Campbell wrote:
> > On Sat, 2011-10-15 at 02:18 +0100, Andreas Olsowski wrote:
> >> It seems this still has not made it into 4.1-testing.
> >
> > I''m afraid I''ve not had time to "figure out how
to automatically select
> > which guests are capable of a cooperative resume and which are
not." so
> > it hasn''t been fixed in xen-unstable either AFAIK.
> >
> Wouldnt just assuming all of them do fix a bigger percentage of systems 
> than leaving it the way it is?
I don''t know -- that''s really the point.
> > I''m also still interested in confirmation to the question I
asked in the
> > mail you just replied to.
> >
> 
> Oh, i thought i allready did.
> 
>  > Are you saying that you don''t see the "failed for
domain %u" message
>  > immediately after the xc_domain_resume call?
>  >
>  > +    if (xc_domain_resume(ctx->xch, domid, 1)) {
>  >            LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
>  >                            "xc_domain_resume failed for domain
%u",
>  >                            domid);
>  >
>  > That would be pretty odd...
> 
> Yes that is what i am saying:
Oh wait, that''s right -- as far as the toolstack is concerned the
resume
was successful -- the subsequent crash is within the guest.

Ian.
> 
> root@memoryana:/var/log/xen# cat xl-testmig.log
> Waiting for domain testmig (domid 2) to die [pid 13349]
> Domain 2 is dead
> Done. Exiting now
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andreas Olsowski

2011-Dec-29 14:08 UTC

head link

Re: pv guests die after failed migration

Has any progress been made regarding this issue?


> On Sat, 2011-10-15 at 11:35 +0100, Andreas Olsowski wrote:
>> On 10/15/2011 07:44 AM, Ian Campbell wrote:
>>> On Sat, 2011-10-15 at 02:18 +0100, Andreas Olsowski wrote:
>>>> It seems this still has not made it into 4.1-testing.
>>> I''m afraid I''ve not had time to "figure out
how to automatically select
>>> which guests are capable of a cooperative resume and which are
not." so
>>> it hasn''t been fixed in xen-unstable either AFAIK.
>>>
>> Wouldnt just assuming all of them do fix a bigger percentage of systems
>> than leaving it the way it is?
> I don''t know -- that''s really the point.
>
>>> I''m also still interested in confirmation to the question
I asked in the
>>> mail you just replied to.
>>>
>> Oh, i thought i allready did.
>>
>>   >  Are you saying that you don''t see the "failed for
domain %u" message
>>   >  immediately after the xc_domain_resume call?
>>   >
>>   >  +    if (xc_domain_resume(ctx->xch, domid, 1)) {
>>   >             LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
>>   >                             "xc_domain_resume failed for
domain %u",
>>   >                             domid);
>>   >
>>   >  That would be pretty odd...
>>
>> Yes that is what i am saying:
> Oh wait, that''s right -- as far as the toolstack is concerned the
resume
> was successful -- the subsequent crash is within the guest.
>
> Ian.
>Andreas



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Sep 2011 - pv guests die after failed migration

[Xen-devel] pv guests die after failed migration

Re: [Xen-devel] pv guests die after failed migration

Re: [Xen-devel] pv guests die after failed migration

Re: [Xen-devel] pv guests die after failed migration

Re: [Xen-devel] pv guests die after failed migration

Re: [Xen-devel] pv guests die after failed migration

Re: [Xen-devel] pv guests die after failed migration

Re: [Xen-devel] pv guests die after failed migration

Re: [Xen-devel] pv guests die after failed migration

Re: [Xen-devel] pv guests die after failed migration

Re: pv guests die after failed migration