Hello,
I''ve know that my machines would crash when trying to migrate VMs from
one host to another. Lately, I''ve been starting to look into the issue
and noticed that it happens when xen is trying to save the VM, i.e. also
when using "xm save" (on a PV guest). I''ve been getting
access to the
console and noticed that when it happens, the hypervisor would log a
segfault and then reboot:
(note that this happened with various 4.1.x xen hypervisors and tools,
I''ve upgraded to the latest version recently just to rule out this is a
bug that has already been fixed)
I have googled for the issue but it seems I am (again) the only person
running into these kinds of troubles...
(XEN) ----[ Xen-4.1.3-rc2-pre x86_64 debug=n Not tainted ]----
(XEN) CPU: 8
(XEN) RIP: e008:[<ffff82c48012e92c>] do_tmem_op+0x116c/0x1630
(XEN) RFLAGS: 0000000000010282 CONTEXT: hypervisor
(XEN) rax: 0000000000000000 rbx: ffff830424383c30 rcx: 0000000000000000
(XEN) rdx: ffff83101bab4620 rsi: 0000000000000000 rdi: 0000000000000001
(XEN) rbp: 000000854e7fa7b0 rsp: ffff8304247b7e08 r8: 0000000000000000
(XEN) r9: 0000000000000010 r10: ffff82c48020b3a0 r11: 0000000000000286
(XEN) r12: ffff83101bab5c30 r13: 0000000000000008 r14: 00000000ffffffff
(XEN) r15: 0000000000000001 cr0: 0000000080050033 cr4: 00000000000006f0
(XEN) cr3: 000000081d7de000 cr2: 0000000000000000
(XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
(XEN) Xen stack trace from rsp=ffff8304247b7e08:
(XEN) ffff830424454000 0000000000000023 0000000000000000 00000000103ea640
(XEN) 0000000000000000 0000001024454000 0000000000000000 0000000100000012
(XEN) 0000000000000010 0000000000000000 0000000000000000 0000000000000000
(XEN) 0000000000000000 ffff82c480180cb7 00007f0b277d0000 ffff880117240088
(XEN) ffff8300d7ada000 ffff8304247b7f18 0000000000000001 ffff82c4801fe396
(XEN) 0000000000000007 0000000000000246 0000000000000000 0000000000000000
(XEN) 00007f0b27c10ff9 000000000000e033 0000000000010203 ffff8300d7ada000
(XEN) ffff88011c7ede98 00000000ffffffe7 ffff880118003540 0000000000000003
(XEN) 00007fff97d5cb70 ffff82c4801f9ad8 00007fff97d5cb70 0000000000000003
(XEN) ffff880118003540 00000000ffffffe7 ffff88011c7ede98 00007fff97d5cb70
(XEN) 0000000000000286 00007f0b27bfc438 0000000000000010 00007f0b28038568
(XEN) 0000000000000026 ffffffff810014ca 00007f0b2802d358 0000000000000000
(XEN) 0000000001155004 0000010000000000 ffffffff810014ca 000000000000e033
(XEN) 0000000000000286 ffff88011c7ede40 000000000000e02b 0000000000000000
(XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000008
(XEN) ffff8300d7ada000 0000003fa44ff880 0000000000000000
(XEN) Xen call trace:
(XEN) [<ffff82c48012e92c>] do_tmem_op+0x116c/0x1630
(XEN) [<ffff82c480180cb7>] copy_from_user+0x27/0x90
(XEN) [<ffff82c4801fe396>] do_iret+0xb6/0x1a0
(XEN) [<ffff82c4801f9ad8>] syscall_enter+0x88/0x8d
(XEN)
(XEN) Pagetable walk from 0000000000000000:
(XEN) L4[0x000] = 000000101b8d7067 0000000000119c69
(XEN) L3[0x000] = 0000000420d48067 000000000011d277
(XEN) L2[0x000] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 8:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0002]
(XEN) Faulting linear address: 0000000000000000
(XEN) ****************************************
(XEN)
I''ve nailed down the crash in do_tmem_op (xen/common/tmem.c) to his
line:
case TMEMC_SAVE_GET_POOL_UUID:
if ( pool == NULL )
break;
uuid = (uint64_t *)buf.p;
--> *uuid++ = pool->uuid[0];
*uuid = pool->uuid[1];
rc = 0;
Apparently buf.p (%rcx) is NULL.
I''ve been trying to figure out how this could happen, but
don''t know
enough about the mechanisms involved to figure it out. Apparently this
op gets called from Dom0 userspace (xend, tools/libxc/xc_tmem.c):
(void)xc_tmem_control(xch,i,TMEMC_SAVE_GET_POOL_UUID,dom,sizeof(uuid),0,0,&uuid);
Which then calls
xen_set_guest_handle(op.u.ctrl.buf, buf);
rc = do_tmem_op(xch,&op);
There are also calls to some bounce buffer handling in case subop is
TMEMC_LIST, but not in the other cases. (no clue what this is about)
(note: the Dom0 kernel is 3.1, but the same has been happening with
older kernels too, and I don''t think the kernel is involved at all
here)
Has anyone else seen this issue? What am I doing wrong? Does my
analysis maybe help a bit in figuring out what is going on?
Thanks,
Jana