Florian Haas
2018-Feb-06 15:11 UTC
[libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)
Hi everyone, I hope this is the correct list to discuss this issue; please feel free to redirect me otherwise. I have a nested virtualization setup that looks as follows: - Host: Ubuntu 16.04, kernel 4.4.0 (an OpenStack Nova compute node) - L0 guest: openSUSE Leap 42.3, kernel 4.4.104-39-default - Nested guest: SLES 12, kernel 3.12.28-4-default The nested guest is configured with "<type arch='x86_64' machine='pc-i440fx-1.4'>hvm</type>". This is working just beautifully, except when the L0 guest wakes up from managed save (openstack server resume in OpenStack parlance). Then, in the L0 guest we immediately see this: [Tue Feb 6 07:00:37 2018] ------------[ cut here ]------------ [Tue Feb 6 07:00:37 2018] kernel BUG at ../arch/x86/kvm/x86.c:328! [Tue Feb 6 07:00:37 2018] invalid opcode: 0000 [#1] SMP [Tue Feb 6 07:00:37 2018] Modules linked in: fuse vhost_net vhost macvtap macvlan xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun br_netfilter bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables vboxpci(O) vboxnetadp(O) vboxnetflt(O) af_packet iscsi_ibft iscsi_boot_sysfs vboxdrv(O) kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel hid_generic usbhid jitterentropy_rng drbg ansi_cprng ppdev parport_pc floppy parport joydev aesni_intel processor button aes_x86_64 virtio_balloon virtio_net lrw gf128mul glue_helper pcspkr serio_raw ablk_helper cryptd i2c_piix4 ext4 crc16 jbd2 mbcache ata_generic [Tue Feb 6 07:00:37 2018] virtio_blk ata_piix ahci libahci cirrus(O) drm_kms_helper(O) syscopyarea sysfillrect sysimgblt fb_sys_fops ttm(O) drm(O) virtio_pci virtio_ring virtio uhci_hcd ehci_hcd usbcore usb_common libata sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 [Tue Feb 6 07:00:37 2018] CPU: 2 PID: 2041 Comm: CPU 0/KVM Tainted: G W O 4.4.104-39-default #1 [Tue Feb 6 07:00:37 2018] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014 [Tue Feb 6 07:00:37 2018] task: ffff880037108d80 ti: ffff88042e964000 task.ti: ffff88042e964000 [Tue Feb 6 07:00:37 2018] RIP: 0010:[<ffffffffa04f20e5>] [<ffffffffa04f20e5>] kvm_spurious_fault+0x5/0x10 [kvm] [Tue Feb 6 07:00:37 2018] RSP: 0018:ffff88042e967d70 EFLAGS: 00010246 [Tue Feb 6 07:00:37 2018] RAX: 0000000000000000 RBX: ffff88042c4f0040 RCX: 0000000000000000 [Tue Feb 6 07:00:37 2018] RDX: 0000000000006820 RSI: 0000000000000282 RDI: ffff88042c4f0040 [Tue Feb 6 07:00:37 2018] RBP: ffff88042c4f00d8 R08: ffff88042e964000 R09: 0000000000000002 [Tue Feb 6 07:00:37 2018] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000001 [Tue Feb 6 07:00:37 2018] R13: 0000021d34fbb21d R14: 0000000000000001 R15: 000055d2157cf840 [Tue Feb 6 07:00:37 2018] FS: 00007f7c52b96700(0000) GS:ffff88043fd00000(0000) knlGS:0000000000000000 [Tue Feb 6 07:00:37 2018] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Tue Feb 6 07:00:37 2018] CR2: 00007f823b15f000 CR3: 0000000429334000 CR4: 0000000000362670 [Tue Feb 6 07:00:37 2018] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [Tue Feb 6 07:00:37 2018] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [Tue Feb 6 07:00:37 2018] Stack: [Tue Feb 6 07:00:37 2018] ffffffffa07939b1 ffffffffa0787875 ffffffffa0503a60 ffff88042c4f0040 [Tue Feb 6 07:00:37 2018] ffffffffa04e5ede ffff88042c4f0040 ffffffffa04e6f0f ffff880037108d80 [Tue Feb 6 07:00:37 2018] ffff88042c4f00e0 ffff88042c4f00e0 ffff88042c4f0040 ffff88042e968000 [Tue Feb 6 07:00:37 2018] Call Trace: [Tue Feb 6 07:00:37 2018] [<ffffffffa07939b1>] intel_pmu_set_msr+0xfc1/0x2341 [kvm_intel] [Tue Feb 6 07:00:37 2018] DWARF2 unwinder stuck at intel_pmu_set_msr+0xfc1/0x2341 [kvm_intel] [Tue Feb 6 07:00:37 2018] Leftover inexact backtrace: [Tue Feb 6 07:00:37 2018] [<ffffffffa0787875>] ? vmx_interrupt_allowed+0x15/0x30 [kvm_intel] [Tue Feb 6 07:00:37 2018] [<ffffffffa0503a60>] ? kvm_arch_vcpu_runnable+0xa0/0xd0 [kvm] [Tue Feb 6 07:00:37 2018] [<ffffffffa04e5ede>] ? kvm_vcpu_check_block+0xe/0x60 [kvm] [Tue Feb 6 07:00:37 2018] [<ffffffffa04e6f0f>] ? kvm_vcpu_block+0x8f/0x310 [kvm] [Tue Feb 6 07:00:37 2018] [<ffffffffa0503c17>] ? kvm_arch_vcpu_ioctl_run+0x187/0x400 [kvm] [Tue Feb 6 07:00:37 2018] [<ffffffffa04ea6d9>] ? kvm_vcpu_ioctl+0x359/0x680 [kvm] [Tue Feb 6 07:00:37 2018] [<ffffffff81016689>] ? __switch_to+0x1c9/0x460 [Tue Feb 6 07:00:37 2018] [<ffffffff81224f02>] ? do_vfs_ioctl+0x322/0x5d0 [Tue Feb 6 07:00:37 2018] [<ffffffff811362ef>] ? __audit_syscall_entry+0xaf/0x100 [Tue Feb 6 07:00:37 2018] [<ffffffff8100383b>] ? syscall_trace_enter_phase1+0x15b/0x170 [Tue Feb 6 07:00:37 2018] [<ffffffff81225224>] ? SyS_ioctl+0x74/0x80 [Tue Feb 6 07:00:37 2018] [<ffffffff81634a02>] ? entry_SYSCALL_64_fastpath+0x16/0xae [Tue Feb 6 07:00:37 2018] Code: d7 fe ff ff 8b 2d 04 6e 06 00 e9 c2 fe ff ff 48 89 f2 e9 65 ff ff ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 89 ff 48 89 [Tue Feb 6 07:00:37 2018] RIP [<ffffffffa04f20e5>] kvm_spurious_fault+0x5/0x10 [kvm] [Tue Feb 6 07:00:37 2018] RSP <ffff88042e967d70> [Tue Feb 6 07:00:37 2018] ---[ end trace e15c567f77920049 ]--- We only hit this kernel bug if we have a nested VM running. The exact same setup, sent into managed save after shutting down the nested VM, wakes up just fine. Now I am aware of https://bugzilla.redhat.com/show_bug.cgi?id=1076294, which talks about live migration — but I think the same considerations apply. I am also aware of https://fedoraproject.org/wiki/How_to_enable_nested_virtualization_in_KVM, which strongly suggests to use host-passthrough or host-model. I have tried both, to no avail. The stack trace persists. I have also tried running a 4.15 kernel in the L0 guest, from https://kernel.opensuse.org/packages/stable, but again, the stack trace persists. What does fix things, of course, is to switch from the nested guest from KVM to Qemu — but that also makes things significantly slower. So I'm wondering: is there someone reading this who does run nested KVM and has managed to successfully live-migrate or managed-save? If so, would you be able to share a working host kernel / L0 guest kernel / nested guest kernel combination, or any other hints for tuning the L0 guest to support managed save and live migration? I'd be extraordinarily grateful for any suggestions. Thanks! Cheers, Florian
Kashyap Chamarthy
2018-Feb-07 15:31 UTC
Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)
[Cc: KVM upstream list.] On Tue, Feb 06, 2018 at 04:11:46PM +0100, Florian Haas wrote:> Hi everyone, > > I hope this is the correct list to discuss this issue; please feel > free to redirect me otherwise. > > I have a nested virtualization setup that looks as follows: > > - Host: Ubuntu 16.04, kernel 4.4.0 (an OpenStack Nova compute node) > - L0 guest: openSUSE Leap 42.3, kernel 4.4.104-39-default > - Nested guest: SLES 12, kernel 3.12.28-4-default > > The nested guest is configured with "<type arch='x86_64' > machine='pc-i440fx-1.4'>hvm</type>". > > This is working just beautifully, except when the L0 guest wakes up > from managed save (openstack server resume in OpenStack parlance). > Then, in the L0 guest we immediately see this:[...] # Snip the call trace from Florian. It is here: https://www.redhat.com/archives/libvirt-users/2018-February/msg00014.html> What does fix things, of course, is to switch from the nested guest > from KVM to Qemu — but that also makes things significantly slower. > > So I'm wondering: is there someone reading this who does run nested > KVM and has managed to successfully live-migrate or managed-save? If > so, would you be able to share a working host kernel / L0 guest kernel > / nested guest kernel combination, or any other hints for tuning the > L0 guest to support managed save and live migration?Following up from our IRC discussion (on #kvm, Freenode). Re-posting my comment here: So I just did a test of 'managedsave' (which is just "save the state of the running VM to a file" in libvirt parlance) of L1, _while_ L2 is running, and I seem to reproduce your case (see the call trace attached). # Ensure L2 (the nested guest) is running on L1. Then, from L0, do # the following: [L0] $ virsh managedsave L1 [L0] $ virsh start L1 --console Result: See the call trace attached to this bug. But L1 goes on to start "fine", and L2 keeps running, too. But things start to seem weird. As in: I try to safely, read-only mount the L2 disk image via libguestfs (by setting export LIBGUESTFS_BACKEND=direct, which uses direct QEMU): `guestfish --ro -a -i ./cirros.qcow2`. It throws the call trace again on the L1 serial console. And the `guestfish` command just sits there forever - L0 (bare metal) Kernel: 4.13.13-300.fc27.x86_64+debug - L1 (guest hypervisor) kernel: 4.11.10-300.fc26.x86_64 - L2 is a CirrOS 3.5 image I can reproduce this at least 3 times, with the above versions. I'm using libvirt 'host-passthrough' for CPU (meaning: '-cpu host' in QEMU parlance) for both L1 and L2. My L0 CPU is: Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz. Thoughts? --- [/me wonders if I'll be asked to reproduce this with newest upstream kernels.] [...] -- /kashyap --oajx4kjjohnnx4ec Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="L1-call-trace-on-start-from-managed-save.txt" $> virsh start f26-devstack --console Domain f26-devstack started Connected to domain f26-devstack Escape character is ^] [ 1323.605321] ------------[ cut here ]------------ [ 1323.608653] kernel BUG at arch/x86/kvm/x86.c:336! [ 1323.611661] invalid opcode: 0000 [#1] SMP [ 1323.614221] Modules linked in: vhost_net vhost tap xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables sb_edac edac_core kvm_intel openvswitch nf_conntrack_ipv6 kvm nf_nat_ipv6 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack irqbypass cr ct10dif_pclmul sunrpc crc32_pclmul ppdev ghash_clmulni_intel parport_pc joydev virtio_net virtio_balloon parport tpm_tis i2c_piix4 tpm_tis_core tpm xfs libcrc32c virtio_blk virtio_console vi rtio_rng crc32c_intel serio_raw virtio_pci ata_generic virtio_ring virtio pata_acpi qemu_fw_cfg [ 1323.645674] CPU: 0 PID: 18587 Comm: CPU 0/KVM Not tainted 4.11.10-300.fc26.x86_64 #1 [ 1323.649592] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-1.fc27 04/01/2014 [ 1323.653935] task: ffff8b5be13ca580 task.stack: ffffa8b78147c000 [ 1323.656783] RIP: 0010:kvm_spurious_fault+0x9/0x10 [kvm] [ 1323.659317] RSP: 0018:ffffa8b78147fc78 EFLAGS: 00010246 [ 1323.661808] RAX: 0000000000000000 RBX: ffff8b5be13c0000 RCX: 0000000000000000 [ 1323.665077] RDX: 0000000000006820 RSI: 0000000000000292 RDI: ffff8b5be13c0000 [ 1323.668287] RBP: ffffa8b78147fc78 R08: ffff8b5be13c0090 R09: 0000000000000000 [ 1323.671515] R10: ffffa8b78147fbf8 R11: 0000000000000000 R12: ffff8b5be13c0088 [ 1323.674598] R13: 0000000000000001 R14: 00000131e2372ee6 R15: ffff8b5be1360040 [ 1323.677643] FS: 00007fd602aff700(0000) GS:ffff8b5bffc00000(0000) knlGS:0000000000000000 [ 1323.681130] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1323.683628] CR2: 000055d650532c20 CR3: 0000000221260000 CR4: 00000000001426f0 [ 1323.686697] Call Trace: [ 1323.687817] intel_pmu_get_msr+0xd23/0x3f44 [kvm_intel] [ 1323.690151] ? vmx_interrupt_allowed+0x19/0x40 [kvm_intel] [ 1323.692583] kvm_arch_vcpu_runnable+0xa5/0xe0 [kvm] [ 1323.694767] kvm_vcpu_check_block+0x12/0x50 [kvm] [ 1323.696858] kvm_vcpu_block+0xa3/0x2f0 [kvm] [ 1323.698762] kvm_arch_vcpu_ioctl_run+0x165/0x16a0 [kvm] [ 1323.701079] ? kvm_arch_vcpu_load+0x6d/0x290 [kvm] [ 1323.703175] ? __check_object_size+0xbb/0x1b3 [ 1323.705109] kvm_vcpu_ioctl+0x2a6/0x620 [kvm] [ 1323.707021] ? kvm_vcpu_ioctl+0x2a6/0x620 [kvm] [ 1323.709006] do_vfs_ioctl+0xa5/0x600 [ 1323.710570] SyS_ioctl+0x79/0x90 [ 1323.712011] entry_SYSCALL_64_fastpath+0x1a/0xa9 [ 1323.714033] RIP: 0033:0x7fd610fb35e7 [ 1323.715601] RSP: 002b:00007fd602afe7c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 1323.718869] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fd610fb35e7 [ 1323.721972] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000013 [ 1323.725044] RBP: 0000563dab190300 R08: 0000563dab1ab7d0 R09: 01fc2de3f821e99c [ 1323.728124] R10: 000000003b9aca00 R11: 0000000000000246 R12: 0000563dadce20a6 [ 1323.731195] R13: 0000000000000000 R14: 00007fd61a84c000 R15: 0000563dadce2000 [ 1323.734268] Code: 8d 00 00 01 c7 05 1c e6 05 00 01 00 00 00 41 bd 01 00 00 00 44 8b 25 2f e6 05 00 e9 db fe ff ff 66 90 0f 1f 44 00 00 55 48 89 e5 <0f> 0b 0f 1f 44 00 00 0f 1f 44 00 00 55 89 ff 48 89 e5 41 54 53 [ 1323.742385] RIP: kvm_spurious_fault+0x9/0x10 [kvm] RSP: ffffa8b78147fc78 [ 1323.745438] ---[ end trace 92fa23c974db8b7e ]--- --oajx4kjjohnnx4ec--
David Hildenbrand
2018-Feb-07 22:26 UTC
Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)
On 07.02.2018 16:31, Kashyap Chamarthy wrote:> [Cc: KVM upstream list.] > > On Tue, Feb 06, 2018 at 04:11:46PM +0100, Florian Haas wrote: >> Hi everyone, >> >> I hope this is the correct list to discuss this issue; please feel >> free to redirect me otherwise. >> >> I have a nested virtualization setup that looks as follows: >> >> - Host: Ubuntu 16.04, kernel 4.4.0 (an OpenStack Nova compute node) >> - L0 guest: openSUSE Leap 42.3, kernel 4.4.104-39-default >> - Nested guest: SLES 12, kernel 3.12.28-4-default >> >> The nested guest is configured with "<type arch='x86_64' >> machine='pc-i440fx-1.4'>hvm</type>". >> >> This is working just beautifully, except when the L0 guest wakes up >> from managed save (openstack server resume in OpenStack parlance). >> Then, in the L0 guest we immediately see this: > > [...] # Snip the call trace from Florian. It is here: > https://www.redhat.com/archives/libvirt-users/2018-February/msg00014.html > >> What does fix things, of course, is to switch from the nested guest >> from KVM to Qemu — but that also makes things significantly slower. >> >> So I'm wondering: is there someone reading this who does run nested >> KVM and has managed to successfully live-migrate or managed-save? If >> so, would you be able to share a working host kernel / L0 guest kernel >> / nested guest kernel combination, or any other hints for tuning the >> L0 guest to support managed save and live migration? > > Following up from our IRC discussion (on #kvm, Freenode). Re-posting my > comment here: > > So I just did a test of 'managedsave' (which is just "save the state of > the running VM to a file" in libvirt parlance) of L1, _while_ L2 is > running, and I seem to reproduce your case (see the call trace > attached). > > # Ensure L2 (the nested guest) is running on L1. Then, from L0, do > # the following: > [L0] $ virsh managedsave L1 > [L0] $ virsh start L1 --console > > Result: See the call trace attached to this bug. But L1 goes on to > start "fine", and L2 keeps running, too. But things start to seem > weird. As in: I try to safely, read-only mount the L2 disk image via > libguestfs (by setting export LIBGUESTFS_BACKEND=direct, which uses > direct QEMU): `guestfish --ro -a -i ./cirros.qcow2`. It throws the call > trace again on the L1 serial console. And the `guestfish` command just > sits there forever > > > - L0 (bare metal) Kernel: 4.13.13-300.fc27.x86_64+debug > - L1 (guest hypervisor) kernel: 4.11.10-300.fc26.x86_64 > - L2 is a CirrOS 3.5 image > > I can reproduce this at least 3 times, with the above versions. > > I'm using libvirt 'host-passthrough' for CPU (meaning: '-cpu host' in > QEMU parlance) for both L1 and L2. > > My L0 CPU is: Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz. > > Thoughts?Sounds like a similar problem as in https://bugzilla.kernel.org/show_bug.cgi?id=198621 In short: there is no (live) migration support for nested VMX yet. So as soon as your guest is using VMX itself ("nVMX"), this is not expected to work. -- Thanks, David / dhildenb
Possibly Parallel Threads
- Re: Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)
- Re: Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)
- Re: Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)
- Re: Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)
- Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)