thr3ads.net - libvirt users - Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running) [Feb 2018]

If this information is useful, please help other people find it:
Share via:

David Hildenbrand

2018-Feb-08 12:07 UTC

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

>> In short: there is no (live) migration support for nested VMX yet. So
as
>> soon as your guest is using VMX itself ("nVMX"), this is not
expected to
>> work.
> 
> Hi David, thanks for getting back to us on this.
Hi Florian,

(sombeody please correct me if I'm wrong)
> 
> I see your point, except the issue Kashyap and I are describing does
> not occur with live migration, it occurs with savevm/loadvm (virsh
> managedsave/virsh start in libvirt terms, nova suspend/resume in
> OpenStack lingo). And it's not immediately self-evident that the
> limitations for the former also apply to the latter. Even for the live
> migration limitation, I've been unsuccessful at finding documentation
> that warns users to not attempt live migration when using nesting, and
> this discussion sounds like a good opportunity for me to help fix
> that.
> 
> Just to give an example,
> https://www.redhat.com/en/blog/inception-how-usable-are-nested-kvm-guests
> from just last September talks explicitly about how "guests can be
> snapshot/resumed, migrated to other hypervisors and much more" in the
> opening paragraph, and then talks at length about nested guests —
> without ever pointing out that those very features aren't expected to
> work for them. :)
Well, it still is a kernel parameter "nested" that is disabled by
default. So things should be expected to be shaky. :) While running
nested guests work usually fine, migrating a nested hypervisor is the
problem.

Especially see e.g.
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/nested_virt

"However, note that nested virtualization is not supported or
recommended in production user environments, and is primarily intended
for development and testing. "
> 
> So to clarify things, could you enumerate the currently known
> limitations when enabling nesting? I'd be happy to summarize those and
> add them to the linux-kvm.org FAQ so others are less likely to hit
> their head on this issue. In particular:
The general problem is that migration of an L1 will not work when it is
running L2, so when L1 is using VMX ("nVMX").

Migrating an L2 should work as before.

The problem is, in order for L1 to make use of VMX to run L2, we have to
run L2 in L0, simulating VMX -> nested VMX a.k.a. nVMX . This requires
additional state information about L1 ("nVMX" state), which is not
properly migrated when migrating L1. Therefore, after migration, the CPU
state of L1 might be screwed up after migration, resulting in L1 crashes.

In addition, certain VMX features might be missing on the target, which
also still has to be handled via the CPU model in the future.

L0, should hopefully not crash, I hope that you are not seeing that.
> 
> - Is
https://fedoraproject.org/wiki/How_to_enable_nested_virtualization_in_KVM
> still accurate in that -cpu host (libvirt "host-passthrough") is
the
> strongly recommended configuration for the L2 guest?
> 
> - If so, are there any recommendations for how to configure the L1
> guest with regard to CPU model?
You have to indicate the VMX feature to your L1 ("nested hypervisor"),
that is usually automatically done by using the "host-passthrough" or
"host-model" value. If you're using a custom CPU model, you have
to
enable it explicitly.
> 
> - Is live migration with nested guests _always_ expected to break on
> all architectures, and if not, which are safe?
x86 VMX: running nested guests works, migrating nested hypervisors does
not work

x86 SVM: running nested guests works, migrating nested hypervisor does
not work (somebody correct me if I'm wrong)

s390x: running nested guests works, migrating nested hypervisors works

power: running nested guests works only via KVM-PR ("trap and
emulate").
migrating nested hypervisors therefore works. But we are not using
hardware virtualization for L1->L2. (my latest status)

arm: running nested guests is in the works (my latest status), migration
is therefore also not possible.
> 
> - Idem, for savevm/loadvm?
> 
savevm/loadvm is not expected to work correctly on an L1 if it is
running L2 guests. It should work on L2 however.
> - With regard to the problem that Kashyap and I (and Dennis, the
> kernel.org bugzilla reporter) are describing, is this expected to work
> any better on AMD CPUs?  (All reports are on Intel)
No, remeber that they are also still missing migration support of the
nested SVM state.
> 
> - Do you expect nested virtualization functionality to be adversely
> affected by KPTI and/or other Meltdown/Spectre mitigation patches?
Not an expert on this. I think it should be affected in a similar way as
ordinary guests :)
> 
> Kashyap, can you think of any other limitations that would benefit
> from improved documentation?
We should certainly document what I have summaries here properly at a
central palce!
> 
> Cheers,
> Florian
> 

-- 

Thanks,

David / dhildenb

Florian Haas

2018-Feb-08 13:29 UTC

head link

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Hi David,

thanks for the added input! I'm taking the liberty to snip a few
paragraphs to trim this email down a bit.

On Thu, Feb 8, 2018 at 1:07 PM, David Hildenbrand <david@redhat.com>
wrote:>> Just to give an example,
>>
https://www.redhat.com/en/blog/inception-how-usable-are-nested-kvm-guests
>> from just last September talks explicitly about how "guests can be
>> snapshot/resumed, migrated to other hypervisors and much more" in
the
>> opening paragraph, and then talks at length about nested guests —
>> without ever pointing out that those very features aren't expected
to
>> work for them. :)
>
> Well, it still is a kernel parameter "nested" that is disabled by
> default. So things should be expected to be shaky. :) While running
> nested guests work usually fine, migrating a nested hypervisor is the
> problem.
>
> Especially see e.g.
>
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/nested_virt
>
> "However, note that nested virtualization is not supported or
> recommended in production user environments, and is primarily intended
> for development and testing. "
Sure, I do understand that Red Hat (or any other vendor) is taking no
support responsibility for this. At this point I'd just like to
contribute to a better understanding of what's expected to definitely
_not_ work, so that people don't bloody their noses on that. :)
>> So to clarify things, could you enumerate the currently known
>> limitations when enabling nesting? I'd be happy to summarize those
and
>> add them to the linux-kvm.org FAQ so others are less likely to hit
>> their head on this issue. In particular:
>
> The general problem is that migration of an L1 will not work when it is
> running L2, so when L1 is using VMX ("nVMX").
>
> Migrating an L2 should work as before.
>
> The problem is, in order for L1 to make use of VMX to run L2, we have to
> run L2 in L0, simulating VMX -> nested VMX a.k.a. nVMX . This requires
> additional state information about L1 ("nVMX" state), which is
not
> properly migrated when migrating L1. Therefore, after migration, the CPU
> state of L1 might be screwed up after migration, resulting in L1 crashes.
>
> In addition, certain VMX features might be missing on the target, which
> also still has to be handled via the CPU model in the future.
Thanks a bunch for the added detail. Now I got a primer today from
Kashyap on IRC on how savevm/loadvm is very similar to migration, but
I'm still struggling to wrap my head around it. What you say makes
perfect sense to me in that _migration_ might blow up in subtle ways,
but can you try to explain to me why the same considerations would
apply with savevm/loadvm?
> L0, should hopefully not crash, I hope that you are not seeing that.
No I am not; we're good there. :)
>> - Is
https://fedoraproject.org/wiki/How_to_enable_nested_virtualization_in_KVM
>> still accurate in that -cpu host (libvirt "host-passthrough")
is the
>> strongly recommended configuration for the L2 guest?
>>
>> - If so, are there any recommendations for how to configure the L1
>> guest with regard to CPU model?
>
> You have to indicate the VMX feature to your L1 ("nested
hypervisor"),
> that is usually automatically done by using the
"host-passthrough" or
> "host-model" value. If you're using a custom CPU model, you
have to
> enable it explicitly.
Roger. Without that we can't do nesting at all.
>> - Is live migration with nested guests _always_ expected to break on
>> all architectures, and if not, which are safe?
>
> x86 VMX: running nested guests works, migrating nested hypervisors does
> not work
>
> x86 SVM: running nested guests works, migrating nested hypervisor does
> not work (somebody correct me if I'm wrong)
>
> s390x: running nested guests works, migrating nested hypervisors works
>
> power: running nested guests works only via KVM-PR ("trap and
emulate").
> migrating nested hypervisors therefore works. But we are not using
> hardware virtualization for L1->L2. (my latest status)
>
> arm: running nested guests is in the works (my latest status), migration
> is therefore also not possible.
Great summary, thanks!
>> - Idem, for savevm/loadvm?
>>
>
> savevm/loadvm is not expected to work correctly on an L1 if it is
> running L2 guests. It should work on L2 however.
Again, I'm somewhat struggling to understand this vs. live migration —
but it's entirely possible that I'm sorely lacking in my knowledge of
kernel and CPU internals.
>> - With regard to the problem that Kashyap and I (and Dennis, the
>> kernel.org bugzilla reporter) are describing, is this expected to work
>> any better on AMD CPUs?  (All reports are on Intel)
>
> No, remeber that they are also still missing migration support of the
> nested SVM state.
Understood, thanks.
>> - Do you expect nested virtualization functionality to be adversely
>> affected by KPTI and/or other Meltdown/Spectre mitigation patches?
>
> Not an expert on this. I think it should be affected in a similar way as
> ordinary guests :)
Fair enough. :)
>> Kashyap, can you think of any other limitations that would benefit
>> from improved documentation?
>
> We should certainly document what I have summaries here properly at a
> central palce!
I tried getting registered on the linux-kvm.org wiki to do exactly
that, and ran into an SMTP/DNS configuration issue with the
verification email. Kashyap said he was going to poke the site admin
about that.

Now, here's a bit more information on my continued testing. As I
mentioned on IRC, one of the things that struck me as odd was that if
I ran into the issue previously described, the L1 guest would enter a
reboot loop if configured with kernel.panic_on_oops=1. In other words,
I would savevm the L1 guest (with a running L2), then loadvm it, and
then the L1 would stack-trace, reboot, and then keep doing that
indefinitely. I found that weird because on the second reboot, I would
expect the system to come up cleanly.

I've now changed my L2 guest's CPU configuration so that libvirt (in
L1) starts the L2 guest with the following settings:

<cpu>
    <model fallback='forbid'>Haswell-noTSX</model>
    <vendor>Intel</vendor>
    <feature policy='disable' name='vme'/>
    <feature policy='disable' name='ss'/>
    <feature policy='disable' name='f16c'/>
    <feature policy='disable' name='rdrand'/>
    <feature policy='disable' name='hypervisor'/>
    <feature policy='disable' name='arat'/>
    <feature policy='disable' name='tsc_adjust'/>
    <feature policy='disable' name='xsaveopt'/>
    <feature policy='disable' name='abm'/>
    <feature policy='disable' name='aes'/>
    <feature policy='disable' name='invpcid'/>
</cpu>

Basically, I am disabling every single feature that my L1's "virsh
capabilities" reports. Now this does not make my L1 come up happily
from loadvm. But it does seem to initiate a clean reboot after loadvm,
and after that clean reboot it lives happily.

If this is as good as it gets (for now), then I can totally live with
that. It certainly beats running the L2 guest with Qemu (without KVM
acceleration). But I would still love to understand the issue a little
bit better.

Cheers,
Florian

David Hildenbrand

2018-Feb-08 13:47 UTC

head link

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

> Sure, I do understand that Red Hat (or any other vendor) is taking no
> support responsibility for this. At this point I'd just like to
> contribute to a better understanding of what's expected to definitely
> _not_ work, so that people don't bloody their noses on that. :)
Indeed. nesting is nice to enable as it works in 99% of all cases. It
just doesn't work when trying to migrate a nested hypervisor. (on x86)

That's what most people don't realize, as it works "just fine"
for 99%
of all use cases.

[...]>>
>> savevm/loadvm is not expected to work correctly on an L1 if it is
>> running L2 guests. It should work on L2 however.
> 
> Again, I'm somewhat struggling to understand this vs. live migration —
> but it's entirely possible that I'm sorely lacking in my knowledge
of
> kernel and CPU internals.
(savevm/loadvm is also called "migration to file")

When we migrate to a file, it really is the same migration stream. You
"dump" the VM state into a file, instead of sending it over to another
(running) target.

Once you load your VM state from that file, it is a completely fresh
VM/KVM environment. So you have to restore all the state. Now, as nVMX
state is not contained in the migration stream, you cannot restore that
state. The L1 state is therefore "damaged" or incomplete.

[...]
>>> Kashyap, can you think of any other limitations that would benefit
>>> from improved documentation?
>>
>> We should certainly document what I have summaries here properly at a
>> central palce!
> 
> I tried getting registered on the linux-kvm.org wiki to do exactly
> that, and ran into an SMTP/DNS configuration issue with the
> verification email. Kashyap said he was going to poke the site admin
> about that.
> 
> Now, here's a bit more information on my continued testing. As I
> mentioned on IRC, one of the things that struck me as odd was that if
> I ran into the issue previously described, the L1 guest would enter a
> reboot loop if configured with kernel.panic_on_oops=1. In other words,
> I would savevm the L1 guest (with a running L2), then loadvm it, and
> then the L1 would stack-trace, reboot, and then keep doing that
> indefinitely. I found that weird because on the second reboot, I would
> expect the system to come up cleanly.
Guess the L1 state (in the kernel) is broken that hard, that even a
reset cannot fix it.
> 
> I've now changed my L2 guest's CPU configuration so that libvirt
(in
> L1) starts the L2 guest with the following settings:
> 
> <cpu>
>     <model fallback='forbid'>Haswell-noTSX</model>
>     <vendor>Intel</vendor>
>     <feature policy='disable' name='vme'/>
>     <feature policy='disable' name='ss'/>
>     <feature policy='disable' name='f16c'/>
>     <feature policy='disable' name='rdrand'/>
>     <feature policy='disable' name='hypervisor'/>
>     <feature policy='disable' name='arat'/>
>     <feature policy='disable' name='tsc_adjust'/>
>     <feature policy='disable' name='xsaveopt'/>
>     <feature policy='disable' name='abm'/>
>     <feature policy='disable' name='aes'/>
>     <feature policy='disable' name='invpcid'/>
> </cpu>
Maybe one of these features is the root cause of the "messed up" state
in KVM. So disabling it also makes the L1 state "less broken".
> 
> Basically, I am disabling every single feature that my L1's "virsh
> capabilities" reports. Now this does not make my L1 come up happily
> from loadvm. But it does seem to initiate a clean reboot after loadvm,
> and after that clean reboot it lives happily.
> 
> If this is as good as it gets (for now), then I can totally live with
> that. It certainly beats running the L2 guest with Qemu (without KVM
> acceleration). But I would still love to understand the issue a little
> bit better.
I mean the real solution to the problem is of course restoring the L1
state correctly (migrating nVMX state, what people are working on right
now). So what you are seeing is a bad "side effect" of that.

For now, nested=true should never be used along with savevm/loadvm/live
migration
> 
> Cheers,
> Florian
> 

-- 

Thanks,

David / dhildenb

Kashyap Chamarthy

2018-Feb-08 14:45 UTC

head link

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

On Thu, Feb 08, 2018 at 01:07:33PM +0100, David Hildenbrand wrote:

[...]
> > So to clarify things, could you enumerate the currently known
> > limitations when enabling nesting? I'd be happy to summarize those
and
> > add them to the linux-kvm.org FAQ so others are less likely to hit
> > their head on this issue. In particular:
> [...] # Snip description of what works in context of migration
> > - Is
https://fedoraproject.org/wiki/How_to_enable_nested_virtualization_in_KVM
> > still accurate in that -cpu host (libvirt
"host-passthrough") is the
> > strongly recommended configuration for the L2 guest?
That wiki is a bit outdated.  And it is not accurate  — if we can just
expose the Intel 'vmx' (or AMD 'svm') CPU feature flag to the L2
guest,
that should be sufficient.  No need for a full passthrough.

That above document should definitely be modified to add more verbiage
comparing 'host-passthrough' vs. 'host-model' vs. custom CPU.
> > - If so, are there any recommendations for how to configure the L1
> > guest with regard to CPU model?
> 
> You have to indicate the VMX feature to your L1 ("nested
hypervisor"),
> that is usually automatically done by using the
"host-passthrough" or
> "host-model" value. If you're using a custom CPU model, you
have to
> enable it explicitly.
> 
> > 
> > - Is live migration with nested guests _always_ expected to break on
> > all architectures, and if not, which are safe?
> 
> x86 VMX: running nested guests works, migrating nested hypervisors does
> not work
> 
> x86 SVM: running nested guests works, migrating nested hypervisor does
> not work (somebody correct me if I'm wrong)
> 
> s390x: running nested guests works, migrating nested hypervisors works
> 
> power: running nested guests works only via KVM-PR ("trap and
emulate").
> migrating nested hypervisors therefore works. But we are not using
> hardware virtualization for L1->L2. (my latest status)
> 
> arm: running nested guests is in the works (my latest status), migration
> is therefore also not possible.
That's a great summary.
> > 
> > - Idem, for savevm/loadvm?
> > 
> 
> savevm/loadvm is not expected to work correctly on an L1 if it is
> running L2 guests. It should work on L2 however.
Yes, that works as intended.
> > - With regard to the problem that Kashyap and I (and Dennis, the
> > kernel.org bugzilla reporter) are describing, is this expected to work
> > any better on AMD CPUs?  (All reports are on Intel)
> 
> No, remeber that they are also still missing migration support of the
> nested SVM state.
Right.  I partly mixed up migration of L1-running-L2 (which doesn't fly
for reasons David already explained) vs. migrating L2 (which works).
> > - Do you expect nested virtualization functionality to be adversely
> > affected by KPTI and/or other Meltdown/Spectre mitigation patches?
> 
> Not an expert on this. I think it should be affected in a similar way as
> ordinary guests :)
> 
> > 
> > Kashyap, can you think of any other limitations that would benefit
> > from improved documentation?
> 
> We should certainly document what I have summaries here properly at a
> central palce!
Yeah, agreed.  Also, when documentation in context of nested, it'd be
useful to explicitly spell out what works or doesn't work at each level
— e.g. L2 can be migrated to a destination L1 just fine; mirating an
L1-running-L2 to a destination L0 will be in dodgy waters for reasons X,
etc.

[...]

-- 
/kashyap

Florian Haas

2018-Feb-08 17:44 UTC

head link

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

On Thu, Feb 8, 2018 at 1:07 PM, David Hildenbrand <david@redhat.com>
wrote:> We should certainly document what I have summaries here properly at a
> central palce!
Please review the three edits I've submitted to the wiki:
https://www.linux-kvm.org/page/Special:Contributions/Fghaas

Feel free to ruthlessly edit/roll back anything that is inaccurate. Thanks!

Cheers,
Florian

Kashyap Chamarthy

2018-Feb-09 10:48 UTC

head link

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

On Thu, Feb 08, 2018 at 06:44:43PM +0100, Florian Haas
wrote:> On Thu, Feb 8, 2018 at 1:07 PM, David Hildenbrand <david@redhat.com>
wrote:
> > We should certainly document what I have summaries here properly at a
> > central palce!
> 
> Please review the three edits I've submitted to the wiki:
> https://www.linux-kvm.org/page/Special:Contributions/Fghaas
> 
> Feel free to ruthlessly edit/roll back anything that is inaccurate.
> Thanks!
I've made some minor edits to clarify a bunch of bits, and a link to the
Kernel doc about Intel nVMX.  (Hope that looks fine.)

You wrote: "L2...which does no further virtualization".  Not quite
true
— "under right circumstances" (read: sufficiently huge machine with
tons
of RAM), L2 _can_ in turn L3. :-)

Last time I checked (this morning), Rich W.M. Jones had 4 levels of
nesting tested with the 'supernested' program[1] he wrote.  (Related
aside: This program is packaged it as part of 2016 QEMU Advent
Calendar[2] -- if you want to play around on a powerful test machine
with tons of free memory.)

[1] http://git.annexia.org/?p=supernested.git;a=blob;f=README
[2] http://www.qemu-advent-calendar.org/2016/#day-13

-- 
/kashyap

Possibly Parallel Threads

Search for more maybe matching threads

libvirt users - Feb 2018 - Re: Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Re: [libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Possibly Parallel Threads