thr3ads.net - libvirt users - [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running) [Feb 2018]

If this information is useful, please help other people find it:
Share via:

Florian Haas

2018-Feb-06 15:11 UTC

[libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Hi everyone,

I hope this is the correct list to discuss this issue; please feel
free to redirect me otherwise.

I have a nested virtualization setup that looks as follows:

- Host: Ubuntu 16.04, kernel 4.4.0 (an OpenStack Nova compute node)
- L0 guest: openSUSE Leap 42.3, kernel 4.4.104-39-default
- Nested guest: SLES 12, kernel 3.12.28-4-default

The nested guest is configured with "<type arch='x86_64'
machine='pc-i440fx-1.4'>hvm</type>".

This is working just beautifully, except when the L0 guest wakes up
from managed save (openstack server resume in OpenStack parlance).
Then, in the L0 guest we immediately see this:

[Tue Feb  6 07:00:37 2018] ------------[ cut here ]------------
[Tue Feb  6 07:00:37 2018] kernel BUG at ../arch/x86/kvm/x86.c:328!
[Tue Feb  6 07:00:37 2018] invalid opcode: 0000 [#1] SMP
[Tue Feb  6 07:00:37 2018] Modules linked in: fuse vhost_net vhost
macvtap macvlan xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
nf_reject_ipv4 xt_tcpudp tun br_netfilter bridge stp llc
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
ip_tables x_tables vboxpci(O) vboxnetadp(O) vboxnetflt(O) af_packet
iscsi_ibft iscsi_boot_sysfs vboxdrv(O) kvm_intel kvm irqbypass
crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
hid_generic usbhid jitterentropy_rng drbg ansi_cprng ppdev parport_pc
floppy parport joydev aesni_intel processor button aes_x86_64
virtio_balloon virtio_net lrw gf128mul glue_helper pcspkr serio_raw
ablk_helper cryptd i2c_piix4 ext4 crc16 jbd2 mbcache ata_generic
[Tue Feb  6 07:00:37 2018]  virtio_blk ata_piix ahci libahci cirrus(O)
drm_kms_helper(O) syscopyarea sysfillrect sysimgblt fb_sys_fops ttm(O)
drm(O) virtio_pci virtio_ring virtio uhci_hcd ehci_hcd usbcore
usb_common libata sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc
scsi_dh_alua scsi_mod autofs4
[Tue Feb  6 07:00:37 2018] CPU: 2 PID: 2041 Comm: CPU 0/KVM Tainted: G
       W  O     4.4.104-39-default #1
[Tue Feb  6 07:00:37 2018] Hardware name: OpenStack Foundation
OpenStack Nova, BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014
[Tue Feb  6 07:00:37 2018] task: ffff880037108d80 ti: ffff88042e964000
task.ti: ffff88042e964000
[Tue Feb  6 07:00:37 2018] RIP: 0010:[<ffffffffa04f20e5>]
[<ffffffffa04f20e5>] kvm_spurious_fault+0x5/0x10 [kvm]
[Tue Feb  6 07:00:37 2018] RSP: 0018:ffff88042e967d70  EFLAGS: 00010246
[Tue Feb  6 07:00:37 2018] RAX: 0000000000000000 RBX: ffff88042c4f0040
RCX: 0000000000000000
[Tue Feb  6 07:00:37 2018] RDX: 0000000000006820 RSI: 0000000000000282
RDI: ffff88042c4f0040
[Tue Feb  6 07:00:37 2018] RBP: ffff88042c4f00d8 R08: ffff88042e964000
R09: 0000000000000002
[Tue Feb  6 07:00:37 2018] R10: 0000000000000004 R11: 0000000000000000
R12: 0000000000000001
[Tue Feb  6 07:00:37 2018] R13: 0000021d34fbb21d R14: 0000000000000001
R15: 000055d2157cf840
[Tue Feb  6 07:00:37 2018] FS:  00007f7c52b96700(0000)
GS:ffff88043fd00000(0000) knlGS:0000000000000000
[Tue Feb  6 07:00:37 2018] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Feb  6 07:00:37 2018] CR2: 00007f823b15f000 CR3: 0000000429334000
CR4: 0000000000362670
[Tue Feb  6 07:00:37 2018] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[Tue Feb  6 07:00:37 2018] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[Tue Feb  6 07:00:37 2018] Stack:
[Tue Feb  6 07:00:37 2018]  ffffffffa07939b1 ffffffffa0787875
ffffffffa0503a60 ffff88042c4f0040
[Tue Feb  6 07:00:37 2018]  ffffffffa04e5ede ffff88042c4f0040
ffffffffa04e6f0f ffff880037108d80
[Tue Feb  6 07:00:37 2018]  ffff88042c4f00e0 ffff88042c4f00e0
ffff88042c4f0040 ffff88042e968000
[Tue Feb  6 07:00:37 2018] Call Trace:
[Tue Feb  6 07:00:37 2018]  [<ffffffffa07939b1>]
intel_pmu_set_msr+0xfc1/0x2341 [kvm_intel]
[Tue Feb  6 07:00:37 2018] DWARF2 unwinder stuck at
intel_pmu_set_msr+0xfc1/0x2341 [kvm_intel]
[Tue Feb  6 07:00:37 2018] Leftover inexact backtrace:
[Tue Feb  6 07:00:37 2018]  [<ffffffffa0787875>] ?
vmx_interrupt_allowed+0x15/0x30 [kvm_intel]
[Tue Feb  6 07:00:37 2018]  [<ffffffffa0503a60>] ?
kvm_arch_vcpu_runnable+0xa0/0xd0 [kvm]
[Tue Feb  6 07:00:37 2018]  [<ffffffffa04e5ede>] ?
kvm_vcpu_check_block+0xe/0x60 [kvm]
[Tue Feb  6 07:00:37 2018]  [<ffffffffa04e6f0f>] ?
kvm_vcpu_block+0x8f/0x310 [kvm]
[Tue Feb  6 07:00:37 2018]  [<ffffffffa0503c17>] ?
kvm_arch_vcpu_ioctl_run+0x187/0x400 [kvm]
[Tue Feb  6 07:00:37 2018]  [<ffffffffa04ea6d9>] ?
kvm_vcpu_ioctl+0x359/0x680 [kvm]
[Tue Feb  6 07:00:37 2018]  [<ffffffff81016689>] ? __switch_to+0x1c9/0x460
[Tue Feb  6 07:00:37 2018]  [<ffffffff81224f02>] ?
do_vfs_ioctl+0x322/0x5d0
[Tue Feb  6 07:00:37 2018]  [<ffffffff811362ef>] ?
__audit_syscall_entry+0xaf/0x100
[Tue Feb  6 07:00:37 2018]  [<ffffffff8100383b>] ?
syscall_trace_enter_phase1+0x15b/0x170
[Tue Feb  6 07:00:37 2018]  [<ffffffff81225224>] ? SyS_ioctl+0x74/0x80
[Tue Feb  6 07:00:37 2018]  [<ffffffff81634a02>] ?
entry_SYSCALL_64_fastpath+0x16/0xae
[Tue Feb  6 07:00:37 2018] Code: d7 fe ff ff 8b 2d 04 6e 06 00 e9 c2
fe ff ff 48 89 f2 e9 65 ff ff ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00
00 00 00 0f 1f 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00
00 55 89 ff 48 89
[Tue Feb  6 07:00:37 2018] RIP  [<ffffffffa04f20e5>]
kvm_spurious_fault+0x5/0x10 [kvm]
[Tue Feb  6 07:00:37 2018]  RSP <ffff88042e967d70>
[Tue Feb  6 07:00:37 2018] ---[ end trace e15c567f77920049 ]---

We only hit this kernel bug if we have a nested VM running. The exact
same setup, sent into managed save after shutting down the nested VM,
wakes up just fine.

Now I am aware of https://bugzilla.redhat.com/show_bug.cgi?id=1076294,
which talks about live migration — but I think the same considerations
apply.

I am also aware of
https://fedoraproject.org/wiki/How_to_enable_nested_virtualization_in_KVM,
which strongly suggests to use host-passthrough or host-model. I have
tried both, to no avail. The stack trace persists. I have also tried
running a 4.15 kernel in the L0 guest, from
https://kernel.opensuse.org/packages/stable, but again, the stack
trace persists.

What does fix things, of course, is to switch from the nested guest
from KVM to Qemu — but that also makes things significantly slower.

So I'm wondering: is there someone reading this who does run nested
KVM and has managed to successfully live-migrate or managed-save? If
so, would you be able to share a working host kernel / L0 guest kernel
/ nested guest kernel combination, or any other hints for tuning the
L0 guest to support managed save and live migration?

I'd be extraordinarily grateful for any suggestions. Thanks!

Cheers,
Florian

Kashyap Chamarthy

2018-Feb-07 15:31 UTC

head link

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

[Cc: KVM upstream list.]

On Tue, Feb 06, 2018 at 04:11:46PM +0100, Florian Haas
wrote:> Hi everyone,
> 
> I hope this is the correct list to discuss this issue; please feel
> free to redirect me otherwise.
> 
> I have a nested virtualization setup that looks as follows:
> 
> - Host: Ubuntu 16.04, kernel 4.4.0 (an OpenStack Nova compute node)
> - L0 guest: openSUSE Leap 42.3, kernel 4.4.104-39-default
> - Nested guest: SLES 12, kernel 3.12.28-4-default
> 
> The nested guest is configured with "<type arch='x86_64'
> machine='pc-i440fx-1.4'>hvm</type>".
> 
> This is working just beautifully, except when the L0 guest wakes up
> from managed save (openstack server resume in OpenStack parlance).
> Then, in the L0 guest we immediately see this:
[...] # Snip the call trace from Florian.  It is here:
https://www.redhat.com/archives/libvirt-users/2018-February/msg00014.html
 > What does fix things, of course, is to switch from the nested guest
> from KVM to Qemu — but that also makes things significantly slower.
> 
> So I'm wondering: is there someone reading this who does run nested
> KVM and has managed to successfully live-migrate or managed-save? If
> so, would you be able to share a working host kernel / L0 guest kernel
> / nested guest kernel combination, or any other hints for tuning the
> L0 guest to support managed save and live migration? 
Following up from our IRC discussion (on #kvm, Freenode).  Re-posting my
comment here:

So I just did a test of 'managedsave' (which is just "save the
state of
the running VM to a file" in libvirt parlance) of L1, _while_ L2 is
running, and I seem to reproduce your case (see the call trace
attached).

    # Ensure L2 (the nested guest) is running on L1.  Then, from L0, do
    # the following:
    [L0] $ virsh managedsave L1
    [L0] $ virsh start L1 --console

Result: See the call trace attached to this bug.  But L1 goes on to
start "fine", and L2 keeps running, too.  But things start to seem
weird.  As in: I try to safely, read-only mount the L2 disk image via
libguestfs (by setting export LIBGUESTFS_BACKEND=direct, which uses
direct QEMU): `guestfish --ro -a -i ./cirros.qcow2`.  It throws the call
trace again on the L1 serial console.  And the `guestfish` command just
sits there forever


  - L0 (bare metal) Kernel: 4.13.13-300.fc27.x86_64+debug
  - L1 (guest hypervisor) kernel: 4.11.10-300.fc26.x86_64
  - L2 is a CirrOS 3.5 image
   
I can reproduce this at least 3 times, with the above versions.

I'm using libvirt 'host-passthrough' for CPU (meaning: '-cpu
host' in
QEMU parlance) for both L1 and L2.

My L0 CPU is:  Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz.

Thoughts?

---

[/me wonders if I'll be asked to reproduce this with newest upstream
kernels.]

[...]

-- 
/kashyap

--oajx4kjjohnnx4ec
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment;
	filename="L1-call-trace-on-start-from-managed-save.txt"

$> virsh start f26-devstack --console
Domain f26-devstack started
Connected to domain f26-devstack
Escape character is ^]                                                          
[ 1323.605321] ------------[ cut here ]------------
[ 1323.608653] kernel BUG at arch/x86/kvm/x86.c:336!
[ 1323.611661] invalid opcode: 0000 [#1] SMP
[ 1323.614221] Modules linked in: vhost_net vhost tap xt_CHECKSUM iptable_mangle
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat tun bridge stp llc
ebtable_filter ebtables ip6table_filter
ip6_tables sb_edac edac_core kvm_intel openvswitch nf_conntrack_ipv6 kvm
nf_nat_ipv6 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_conntrack nf_conntrack irqbypass cr
ct10dif_pclmul sunrpc crc32_pclmul ppdev ghash_clmulni_intel parport_pc joydev
virtio_net virtio_balloon parport tpm_tis i2c_piix4 tpm_tis_core tpm xfs
libcrc32c virtio_blk virtio_console vi
rtio_rng crc32c_intel serio_raw virtio_pci ata_generic virtio_ring virtio
pata_acpi qemu_fw_cfg
[ 1323.645674] CPU: 0 PID: 18587 Comm: CPU 0/KVM Not tainted
4.11.10-300.fc26.x86_64 #1
[ 1323.649592] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.11.0-1.fc27 04/01/2014
[ 1323.653935] task: ffff8b5be13ca580 task.stack: ffffa8b78147c000
[ 1323.656783] RIP: 0010:kvm_spurious_fault+0x9/0x10 [kvm]
[ 1323.659317] RSP: 0018:ffffa8b78147fc78 EFLAGS: 00010246
[ 1323.661808] RAX: 0000000000000000 RBX: ffff8b5be13c0000 RCX: 0000000000000000
[ 1323.665077] RDX: 0000000000006820 RSI: 0000000000000292 RDI: ffff8b5be13c0000
[ 1323.668287] RBP: ffffa8b78147fc78 R08: ffff8b5be13c0090 R09: 0000000000000000
[ 1323.671515] R10: ffffa8b78147fbf8 R11: 0000000000000000 R12: ffff8b5be13c0088
[ 1323.674598] R13: 0000000000000001 R14: 00000131e2372ee6 R15: ffff8b5be1360040
[ 1323.677643] FS:  00007fd602aff700(0000) GS:ffff8b5bffc00000(0000)
knlGS:0000000000000000
[ 1323.681130] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1323.683628] CR2: 000055d650532c20 CR3: 0000000221260000 CR4: 00000000001426f0
[ 1323.686697] Call Trace:
[ 1323.687817]  intel_pmu_get_msr+0xd23/0x3f44 [kvm_intel]
[ 1323.690151]  ? vmx_interrupt_allowed+0x19/0x40 [kvm_intel]
[ 1323.692583]  kvm_arch_vcpu_runnable+0xa5/0xe0 [kvm]
[ 1323.694767]  kvm_vcpu_check_block+0x12/0x50 [kvm]
[ 1323.696858]  kvm_vcpu_block+0xa3/0x2f0 [kvm]
[ 1323.698762]  kvm_arch_vcpu_ioctl_run+0x165/0x16a0 [kvm]
[ 1323.701079]  ? kvm_arch_vcpu_load+0x6d/0x290 [kvm]
[ 1323.703175]  ? __check_object_size+0xbb/0x1b3
[ 1323.705109]  kvm_vcpu_ioctl+0x2a6/0x620 [kvm]
[ 1323.707021]  ? kvm_vcpu_ioctl+0x2a6/0x620 [kvm]
[ 1323.709006]  do_vfs_ioctl+0xa5/0x600
[ 1323.710570]  SyS_ioctl+0x79/0x90
[ 1323.712011]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 1323.714033] RIP: 0033:0x7fd610fb35e7
[ 1323.715601] RSP: 002b:00007fd602afe7c8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 1323.718869] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fd610fb35e7
[ 1323.721972] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000013
[ 1323.725044] RBP: 0000563dab190300 R08: 0000563dab1ab7d0 R09: 01fc2de3f821e99c
[ 1323.728124] R10: 000000003b9aca00 R11: 0000000000000246 R12: 0000563dadce20a6
[ 1323.731195] R13: 0000000000000000 R14: 00007fd61a84c000 R15: 0000563dadce2000
[ 1323.734268] Code: 8d 00 00 01 c7 05 1c e6 05 00 01 00 00 00 41 bd 01 00 00 00
44 8b 25 2f e6 05 00 e9 db fe ff ff 66 90 0f 1f 44 00 00 55 48 89 e5 <0f>
0b 0f 1f 44 00 00 0f 1f 44 00 00 55
 89 ff 48 89 e5 41 54 53
[ 1323.742385] RIP: kvm_spurious_fault+0x9/0x10 [kvm] RSP: ffffa8b78147fc78
[ 1323.745438] ---[ end trace 92fa23c974db8b7e ]---

--oajx4kjjohnnx4ec--

David Hildenbrand

2018-Feb-07 22:26 UTC

head link

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

On 07.02.2018 16:31, Kashyap Chamarthy wrote:> [Cc: KVM upstream list.]
> 
> On Tue, Feb 06, 2018 at 04:11:46PM +0100, Florian Haas wrote:
>> Hi everyone,
>>
>> I hope this is the correct list to discuss this issue; please feel
>> free to redirect me otherwise.
>>
>> I have a nested virtualization setup that looks as follows:
>>
>> - Host: Ubuntu 16.04, kernel 4.4.0 (an OpenStack Nova compute node)
>> - L0 guest: openSUSE Leap 42.3, kernel 4.4.104-39-default
>> - Nested guest: SLES 12, kernel 3.12.28-4-default
>>
>> The nested guest is configured with "<type
arch='x86_64'
>> machine='pc-i440fx-1.4'>hvm</type>".
>>
>> This is working just beautifully, except when the L0 guest wakes up
>> from managed save (openstack server resume in OpenStack parlance).
>> Then, in the L0 guest we immediately see this:
> 
> [...] # Snip the call trace from Florian.  It is here:
> https://www.redhat.com/archives/libvirt-users/2018-February/msg00014.html
>  
>> What does fix things, of course, is to switch from the nested guest
>> from KVM to Qemu — but that also makes things significantly slower.
>>
>> So I'm wondering: is there someone reading this who does run nested
>> KVM and has managed to successfully live-migrate or managed-save? If
>> so, would you be able to share a working host kernel / L0 guest kernel
>> / nested guest kernel combination, or any other hints for tuning the
>> L0 guest to support managed save and live migration?
>  
> Following up from our IRC discussion (on #kvm, Freenode).  Re-posting my
> comment here:
> 
> So I just did a test of 'managedsave' (which is just "save the
state of
> the running VM to a file" in libvirt parlance) of L1, _while_ L2 is
> running, and I seem to reproduce your case (see the call trace
> attached).
> 
>     # Ensure L2 (the nested guest) is running on L1.  Then, from L0, do
>     # the following:
>     [L0] $ virsh managedsave L1
>     [L0] $ virsh start L1 --console
> 
> Result: See the call trace attached to this bug.  But L1 goes on to
> start "fine", and L2 keeps running, too.  But things start to
seem
> weird.  As in: I try to safely, read-only mount the L2 disk image via
> libguestfs (by setting export LIBGUESTFS_BACKEND=direct, which uses
> direct QEMU): `guestfish --ro -a -i ./cirros.qcow2`.  It throws the call
> trace again on the L1 serial console.  And the `guestfish` command just
> sits there forever
> 
> 
>   - L0 (bare metal) Kernel: 4.13.13-300.fc27.x86_64+debug
>   - L1 (guest hypervisor) kernel: 4.11.10-300.fc26.x86_64
>   - L2 is a CirrOS 3.5 image
>    
> I can reproduce this at least 3 times, with the above versions.
> 
> I'm using libvirt 'host-passthrough' for CPU (meaning:
'-cpu host' in
> QEMU parlance) for both L1 and L2.
> 
> My L0 CPU is:  Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz.
> 
> Thoughts?
Sounds like a similar problem as in
https://bugzilla.kernel.org/show_bug.cgi?id=198621

In short: there is no (live) migration support for nested VMX yet. So as
soon as your guest is using VMX itself ("nVMX"), this is not expected
to
work.

-- 

Thanks,

David / dhildenb

Possibly Parallel Threads

Search for more maybe matching threads

libvirt users - Feb 2018 - Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

[libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Possibly Parallel Threads