thr3ads.net - Xen devel - [BUG] mm locking order violation when HVM guest changes graphics mode on virtual graphics adapter. [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Austin S Hemmelgarn

2013-Oct-31 14:00 UTC

[BUG] mm locking order violation when HVM guest changes graphics mode on virtual graphics adapter.

Reliably reproducible, occurs when HVM guest changes graphics mode
virtual graphics adapter on Xen 4.3.0 from Gentoo.

To reproduce: Using Xen 4.3.0 from Gentoo Portage Tree, and the
corresponding version of Xl, both built with GCC 4.7.3 with HVM and
qemu-dm support built in:
1. Boot using a Gentoo Linux Dom0 with kernel version 3.10.7-r1 built
with the kernel config found at http://pastebin.com/GxDpPsk3.
2. Get a copy of the fedora i686 network install CD.
2. Start a HVM domain with a configuration like the one found at
http://pastebin.com/p0wxnaTg.
3. After connecting to the VNC console, start the install process.
4. When Anaconda tries to start the graphical environment, causing the
kernel to change the graphics mode from the current setting, xen will
crash with a call to BUG() in mm.h at line 118.

Xen log can be found at http://pastebin.com/zKCJsp21.
xl info output can be found at http://pastebin.com/NqtksS18.
lspci -vvv output can be found at http://pastebin.com/Ja97Cx42.
xenstore contents can be found at http://pastebin.com/aL9vpxwu.

I''ll be happy to provide any other information you may need upon
request.

Andrew Cooper

2013-Oct-31 16:04 UTC

head link

Re: [BUG] mm locking order violation when HVM guest changes graphics mode on virtual graphics adapter.

On 31/10/13 14:00, Austin S Hemmelgarn wrote:> Reliably reproducible, occurs when HVM guest changes graphics mode
> virtual graphics adapter on Xen 4.3.0 from Gentoo.
>
> To reproduce: Using Xen 4.3.0 from Gentoo Portage Tree, and the
> corresponding version of Xl, both built with GCC 4.7.3 with HVM and
> qemu-dm support built in:
> 1. Boot using a Gentoo Linux Dom0 with kernel version 3.10.7-r1 built
> with the kernel config found at http://pastebin.com/GxDpPsk3.
> 2. Get a copy of the fedora i686 network install CD.
> 2. Start a HVM domain with a configuration like the one found at
> http://pastebin.com/p0wxnaTg.
> 3. After connecting to the VNC console, start the install process.
> 4. When Anaconda tries to start the graphical environment, causing the
> kernel to change the graphics mode from the current setting, xen will
> crash with a call to BUG() in mm.h at line 118.
>
> Xen log can be found at http://pastebin.com/zKCJsp21.
> xl info output can be found at http://pastebin.com/NqtksS18.
> lspci -vvv output can be found at http://pastebin.com/Ja97Cx42.
> xenstore contents can be found at http://pastebin.com/aL9vpxwu.
>
> I''ll be happy to provide any other information you may need upon
request.
Are you able to build Xen yourself?  If so, could you try running with a
debug build of Xen?

~Andrew
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Andres Lagar-Cavilla

2013-Oct-31 16:27 UTC

head link

Re: [BUG] mm locking order violation when HVM guest changes graphics mode on virtual graphics adapter.

> Reliably reproducible, occurs when HVM guest changes graphics mode
> virtual graphics adapter on Xen 4.3.0 from Gentoo.
> 
> To reproduce: Using Xen 4.3.0 from Gentoo Portage Tree, and the
> corresponding version of Xl, both built with GCC 4.7.3 with HVM and
> qemu-dm support built in:
> 1. Boot using a Gentoo Linux Dom0 with kernel version 3.10.7-r1 built
> with the kernel config found at http://pastebin.com/GxDpPsk3.
> 2. Get a copy of the fedora i686 network install CD.
> 2. Start a HVM domain with a configuration like the one found at
> http://pastebin.com/p0wxnaTg.
> 3. After connecting to the VNC console, start the install process.
> 4. When Anaconda tries to start the graphical environment, causing the
> kernel to change the graphics mode from the current setting, xen will
> crash with a call to BUG() in mm.h at line 118.
> 
> Xen log can be found at http://pastebin.com/zKCJsp21.
> xl info output can be found at http://pastebin.com/NqtksS18.
> lspci -vvv output can be found at http://pastebin.com/Ja97Cx42.
> xenstore contents can be found at http://pastebin.com/aL9vpxwu.
> 
> I''ll be happy to provide any other information you may need upon
request.
Thanks for the report.

From what I can glean you are using AMD NPT, can you confirm?

So the trigger is that you are using both PoD and nested virt. To elaborate:
- Setting maxmem to 2G and men to 512M uses the PoD (populate on demand
subsystem) to account for the 1.5GB of extra wiggle room. Please make sure you
have a guest balloon that will be able to deal with the guest trying to use over
512M.
- You have nestedhvm=1. Do you really need this?

Changing either (memory == maxmem or nestedhvm=0) will remove the problem and
allow you to make progress.

There is a real bug, however, that needs to be fixed here. At some point in the
4.3 cycle the flushing of the nested p2m table was added, and it would seem to
be relinquishing the p2m lock:
__get_gfn_type_access -> grab p2m lock
p2m_pod_demand_populate -> grab pod lock
p2m_next_level -> still holding p2m lock
then drops it
p2m_flush_table -> grabs p2m lock -> KAPOW

Andres

Austin S Hemmelgarn

2013-Oct-31 17:08 UTC

head link

Re: [BUG] mm locking order violation when HVM guest changes graphics mode on virtual graphics adapter.

On 2013-10-31 12:27, Andres Lagar-Cavilla wrote:>> Reliably reproducible, occurs when HVM guest changes graphics mode
>> virtual graphics adapter on Xen 4.3.0 from Gentoo.
>>
>> To reproduce: Using Xen 4.3.0 from Gentoo Portage Tree, and the
>> corresponding version of Xl, both built with GCC 4.7.3 with HVM and
>> qemu-dm support built in:
>> 1. Boot using a Gentoo Linux Dom0 with kernel version 3.10.7-r1 built
>> with the kernel config found at http://pastebin.com/GxDpPsk3.
>> 2. Get a copy of the fedora i686 network install CD.
>> 2. Start a HVM domain with a configuration like the one found at
>> http://pastebin.com/p0wxnaTg.
>> 3. After connecting to the VNC console, start the install process.
>> 4. When Anaconda tries to start the graphical environment, causing the
>> kernel to change the graphics mode from the current setting, xen will
>> crash with a call to BUG() in mm.h at line 118.
>>
>> Xen log can be found at http://pastebin.com/zKCJsp21.
>> xl info output can be found at http://pastebin.com/NqtksS18.
>> lspci -vvv output can be found at http://pastebin.com/Ja97Cx42.
>> xenstore contents can be found at http://pastebin.com/aL9vpxwu.
>>
>> I''ll be happy to provide any other information you may need
upon request.
> 
> Thanks for the report.
> 
> From what I can glean you are using AMD NPT, can you confirm?
> 
> So the trigger is that you are using both PoD and nested virt. To
elaborate:
> - Setting maxmem to 2G and men to 512M uses the PoD (populate on demand
subsystem) to account for the 1.5GB of extra wiggle room. Please make sure you
have a guest balloon that will be able to deal with the guest trying to use over
512M.
> - You have nestedhvm=1. Do you really need this?
> 
> Changing either (memory == maxmem or nestedhvm=0) will remove the problem
and allow you to make progress.
> 
> There is a real bug, however, that needs to be fixed here. At some point in
the 4.3 cycle the flushing of the nested p2m table was added, and it would seem
to be relinquishing the p2m lock:
> __get_gfn_type_access -> grab p2m lock
> p2m_pod_demand_populate -> grab pod lock
> p2m_next_level -> still holding p2m lock
> then drops it
> p2m_flush_table -> grabs p2m lock -> KAPOW
> 
> Andres
> Thanks for the quick response, I in fact probably don''t need the nested
hvm, I just have gotten in the habit of using it because many people who
use the system like to use virtualbox, which dosen''t work too well
without it (although it still works much better than trying to run it on
a PV domain).  As for the memory balloon, i probably don''t need that
either, it was mostly to try to improve domain creation speed because
the domain gets started during boot.  Tried disabling both and things
work. Thanks again.

Andres Lagar-Cavilla

2013-Nov-01 16:05 UTC

head link

Re: [BUG] mm locking order violation when HVM guest changes graphics mode on virtual graphics adapter.

On Oct 31, 2013, at 12:27 PM, Andres Lagar-Cavilla
<andreslc@gridcentric.ca> wrote:
>> Reliably reproducible, occurs when HVM guest changes graphics mode
>> virtual graphics adapter on Xen 4.3.0 from Gentoo.
>> 
>> To reproduce: Using Xen 4.3.0 from Gentoo Portage Tree, and the
>> corresponding version of Xl, both built with GCC 4.7.3 with HVM and
>> qemu-dm support built in:
>> 1. Boot using a Gentoo Linux Dom0 with kernel version 3.10.7-r1 built
>> with the kernel config found at http://pastebin.com/GxDpPsk3.
>> 2. Get a copy of the fedora i686 network install CD.
>> 2. Start a HVM domain with a configuration like the one found at
>> http://pastebin.com/p0wxnaTg.
>> 3. After connecting to the VNC console, start the install process.
>> 4. When Anaconda tries to start the graphical environment, causing the
>> kernel to change the graphics mode from the current setting, xen will
>> crash with a call to BUG() in mm.h at line 118.
>> 
>> Xen log can be found at http://pastebin.com/zKCJsp21.
>> xl info output can be found at http://pastebin.com/NqtksS18.
>> lspci -vvv output can be found at http://pastebin.com/Ja97Cx42.
>> xenstore contents can be found at http://pastebin.com/aL9vpxwu.
>> 
>> I''ll be happy to provide any other information you may need
upon request.
> 
> Thanks for the report.
> 
> From what I can glean you are using AMD NPT, can you confirm?
> 
> So the trigger is that you are using both PoD and nested virt. To
elaborate:
> - Setting maxmem to 2G and men to 512M uses the PoD (populate on demand
subsystem) to account for the 1.5GB of extra wiggle room. Please make sure you
have a guest balloon that will be able to deal with the guest trying to use over
512M.
> - You have nestedhvm=1. Do you really need this?
> 
> Changing either (memory == maxmem or nestedhvm=0) will remove the problem
and allow you to make progress.
> 
> There is a real bug, however, that needs to be fixed here. At some point in
the 4.3 cycle the flushing of the nested p2m table was added, and it would seem
to be relinquishing the p2m lock:George, Tim,
Paging you in for a bit more insight. The bug is as follows
1. pod allocate zero page
2. an intermediate level in the p2m needs to be allocated
3. thus nested p2m needs to be flushed
4. attempts to grab p2m lock on the nested p2m
5. explode

This is because the current level of locking is pod lock which exceeds the p2m
level lock. Some solutions:
1. defer flushing of nested p2m until we are done with the fault and have
unrolled enough stack
2. have the nested p2m locks look "different" to the lock ordering
machinery

I think 2 is tricky because there are paths in which there is no reason for them
to look any different, like the regular hap nested fault handler
(nestedhvm_hap_nested_page_fault). Really the only part where this blows up is
in the flushing of the nested tables (all one of them?!)

So I propose defering, or some other cunning idea.

Thanks
Andres

> __get_gfn_type_access -> grab p2m lock
> p2m_pod_demand_populate -> grab pod lock
> p2m_next_level -> still holding p2m lock
> then drops it
> p2m_flush_table -> grabs p2m lock -> KAPOW
> 
> Andres
>

Tim Deegan

2013-Nov-02 23:12 UTC

head link

Re: [BUG] mm locking order violation when HVM guest changes graphics mode on virtual graphics adapter.

At 12:05 -0400 on 01 Nov (1383303953), Andres Lagar-Cavilla
wrote:> Paging you in for a bit more insight. The bug is as follows
> 1. pod allocate zero page
> 2. an intermediate level in the p2m needs to be allocated
> 3. thus nested p2m needs to be flushed
> 4. attempts to grab p2m lock on the nested p2m
> 5. explode
Right.  I think the actual ordering is that PoD is reclaiming a zero
page (and therefore removing an existing p2m entry), so 
p2m_pod_zero_check()
 -> p2m_set_entry() (via set_p2m_entry())
 -> hap_write_p2m_entry()
 -> p2m_flush_nested_p2m (tail call, so hap_write_p2m_entry() isn''t
    the stack)
 -> p2m_lock() :(
> This is because the current level of locking is pod lock which
> exceeds the p2m level lock. Some solutions:
> 1. defer flushing of nested p2m until we are done with the fault and
> have unrolled enough stack 
That could be a bit intricate, and I think it''s probably not safe.
What we''re doing here is reclaiming a zero page so we can use it to
serve a PoD fault.  That means we''re going to map that page,
guest-writeable, in a new GFN before we release the PoD lock, so we
_must_ be able to flush the old mapping without dropping the PoD lock
(unless we want to add an unlock-and-retry loop in the PoD handler...)

I guess we can avoid that by always keeping at least one PoD page free
per vcpu -- then, I think, we could defer this flush until after PoD
unlock, since all other changes that the PoD code makes are safe not
to flush (replacing not-present entries).  But even then we have a
race window after we pod-unlock and before we finish flushing where
another CPU could pod-lock and map the page.
> 2. have the nested p2m locks look "different" to the lock
ordering machinery
Also unpalatable.  But yes, we could invent up a special case to make
p2m_lock()ing a nested p2m into its own thing (ranked below PoD in
mm-locks.h, and safe because we never do PoD or paging ops on nested p2ms).

I think that (1) would be nice to have, since deferring/batching these
nested flushes is probably a good idea in other cases.  But (2) will be
much easier to implement. 

Cheers,

Tim.

Xen devel - Oct 2013 - [BUG] mm locking order violation when HVM guest changes graphics mode on virtual graphics adapter.

[BUG] mm locking order violation when HVM guest changes graphics mode on virtual graphics adapter.

Re: [BUG] mm locking order violation when HVM guest changes graphics mode on virtual graphics adapter.

Re: [BUG] mm locking order violation when HVM guest changes graphics mode on virtual graphics adapter.

Re: [BUG] mm locking order violation when HVM guest changes graphics mode on virtual graphics adapter.

Re: [BUG] mm locking order violation when HVM guest changes graphics mode on virtual graphics adapter.

Re: [BUG] mm locking order violation when HVM guest changes graphics mode on virtual graphics adapter.