thr3ads.net - Xen devel - [Xen-devel] long latency of domain shutdown [Apr 2008]

If this information is useful, please help other people find it:
Share via:

Jan Beulich

2008-Apr-28 13:45 UTC

[Xen-devel] long latency of domain shutdown

In (3.0.4-based) SLE10 SP1 we are currently dealing with a (reproducible)
report of time getting screwed up during domain shutdown. Debugging
revealed that the PM timer misses at least one overflow (i.e. platform
time lost about 4 seconds), which subsequently leads to disastrous
effects.

Apart from tracking the time calibration, as the (currently) last step of
narrowing the cause I now made the first processor detecting severe
anomalies in time flow send an IPI to CPU0 (which is exclusively
responsible for managing platform time), which appears to prove that
this CPU is indeed busy processing a domain_kill() request, and namely
is in the process of tearing down the address spaces of the guest.

Obviously, the hypervisor''s behavior should not depend on the amount
of time needed to free a dead domain''s resources, but the way it is
coded (and from doing some code comparison I would conclude that
while the code has significantly changed, the base characteristic of
domain shutdown being executed synchronously on the CPU requesting
so doesn''t appear to have changed - of course, history shows that I
may easily overlook something here), and if that CPU happens to be
CPU0 the whole system will suffer due to the asymmetry of platform
time handling.

If I''m indeed not overlooking an important fix in that area, what would
be considered a reasonable solution to this? I can imagine (in order of
my preference)

- inserting calls to do_softirq() in the put_page_and_type() call
hierarchy (e.g. in alloc_l2_table() or even alloc_l1_table(), to
guarantee uniform behavior across sub-architectures; this might help
address other issues as the same scenario might happen when a
page table hierarchy gets destroyed at times other than domain
shutdown); perhaps the same might then also be needed in the
get_page_type() hierarchy, e.g. in free_l{2,1}_table()

- simply doing round-robin responsibility of platform time among all
CPUs (would leave the unlikely UP case as still affected by the problem)

- detecting platform timer overflow (and properly estimating how many
times it has overflowed) and sync-ing platform time back from local time
(as indicated in a comment somewhere)

- marshalling the whole operation to another CPU

For reference, this is the CPU0 backtrace I''m getting from the IPI:

(XEN) *** Dumping CPU0 host state: ***
(XEN) State at keyhandler.c:109
(XEN) ----[ Xen-3.0.4_13138-0.63  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff83000010e8a2>] dump_execstate+0x62/0xe0
(XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: 0000000000000000   rcx: 000000000013dd62
(XEN) rdx: 000000000000000a   rsi: 000000000000000a   rdi: ffff8300002b2142
(XEN) rbp: 0000000000000000   rsp: ffff8300001d3a30   r8:  0000000000000001
(XEN) r9:  0000000000000001   r10: 00000000fffffffc   r11: 0000000000000001
(XEN) r12: 0000000000000001   r13: 0000000000000001   r14: 0000000000000001
(XEN) r15: cccccccccccccccd   cr0: 0000000080050033   cr4: 00000000000006f0
(XEN) cr3: 000000000ce02000   cr2: 00002b47f8871ca8
(XEN) ds: 0000   es: 0000   fs: 0063   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff8300001d3a30:
(XEN)    0000000000000046 ffff830000f7e280 ffff8300002b0e00 ffff830000f7e280
(XEN)    ffff83000013b665 0000000000000000 ffff83000012dc8a cccccccccccccccd
(XEN)    0000000000000001 0000000000000001 0000000000000001 ffff830000f7e280
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    ffff8284008f7aa0 ffff8284008f7ac8 0000000000000000 0000000000000000
(XEN)    0000000000039644 ffff8284008f7aa0 000000fb00000000 ffff83000011345d
(XEN)    000000000000e008 0000000000000246 ffff8300001d3b18 000000000000e010
(XEN)    ffff830000113348 ffff83000013327f 0000000000000000 ffff8284008f7aa0
(XEN)    ffff8307cc1b7288 ffff8307cc1b8000 ffff830000f7e280 00000000007cc315
(XEN)    ffff8284137e4498 ffff830000f7e280 ffff830000132c24 0000000020000001
(XEN)    0000000020000000 ffff8284137e4498 00000000007cc315 ffff8284137e7b48
(XEN)    ffff830000132ec4 ffff8284137e4498 000000000000015d ffff830000f7e280
(XEN)    ffff8300001328d2 ffff8307cc315ae8 ffff830000132cbb 0000000040000001
(XEN)    0000000040000000 ffff8284137e7b48 ffff830000f7e280 ffff8284137f6be8
(XEN)    ffff830000132ec4 ffff8284137e7b48 00000000007cc919 ffff8307cc91a000
(XEN)    ffff8300001331a2 ffff8307cc919018 ffff830000132d41 0000000060000001
(XEN)    0000000060000000 ffff8284137f6be8 0000000000006ea6 ffff8284001149f0
(XEN)    ffff830000132ec4 ffff8284137f6be8 0000000000000110 ffff830000f7e280
(XEN)    ffff830000133132 ffff830006ea6880 ffff830000132df0 0000000080000001
(XEN)    0000000080000000 ffff8284001149f0 ffff8284001149f0 ffff8284001149f0
(XEN) Xen call trace:
(XEN)    [<ffff83000010e8a2>] dump_execstate+0x62/0xe0
(XEN)    [<ffff83000013b665>] smp_call_function_interrupt+0x55/0xa0
(XEN)    [<ffff83000012dc8a>] call_function_interrupt+0x2a/0x30
(XEN)    [<ffff83000011345d>] free_domheap_pages+0x2bd/0x3b0
(XEN)    [<ffff830000113348>] free_domheap_pages+0x1a8/0x3b0
(XEN)    [<ffff83000013327f>] put_page_from_l1e+0x9f/0x120
(XEN)    [<ffff830000132c24>] free_page_type+0x314/0x540
(XEN)    [<ffff830000132ec4>] put_page_type+0x74/0xf0
(XEN)    [<ffff8300001328d2>] put_page_from_l2e+0x32/0x70
(XEN)    [<ffff830000132cbb>] free_page_type+0x3ab/0x540
(XEN)    [<ffff830000132ec4>] put_page_type+0x74/0xf0
(XEN)    [<ffff8300001331a2>] put_page_from_l3e+0x32/0x70
(XEN)    [<ffff830000132d41>] free_page_type+0x431/0x540
(XEN)    [<ffff830000132ec4>] put_page_type+0x74/0xf0
(XEN)    [<ffff830000133132>] put_page_from_l4e+0x32/0x70
(XEN)    [<ffff830000132df0>] free_page_type+0x4e0/0x540
(XEN)    [<ffff830000132ec4>] put_page_type+0x74/0xf0
(XEN)    [<ffff83000012923a>] relinquish_memory+0x17a/0x290
(XEN)    [<ffff830000183665>] identify_cpu+0x5/0x1f0
(XEN)    [<ffff830000117f10>] vcpu_runstate_get+0xb0/0xf0
(XEN)    [<ffff8300001296aa>] domain_relinquish_resources+0x35a/0x3b0
(XEN)    [<ffff8300001083e8>] domain_kill+0x28/0x60
(XEN)    [<ffff830000107560>] do_domctl+0x690/0xe60
(XEN)    [<ffff830000121def>] __putstr+0x1f/0x70
(XEN)    [<ffff830000138016>] mod_l1_entry+0x636/0x670
(XEN)    [<ffff830000118143>] schedule+0x1f3/0x270
(XEN)    [<ffff830000175ca6>] toggle_guest_mode+0x126/0x140
(XEN)    [<ffff830000175fa8>] do_iret+0xa8/0x1c0
(XEN)    [<ffff830000173b32>] syscall_enter+0x62/0x67

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Apr-28 13:59 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

This was addressed by xen-unstable:15821. The fix is present in releases
since 3.2.0. It was never backported to 3.1 branch.

There are a few changesets related to 15821 that you would also want to take
into your tree. For example, 15838 is a bugfix. And there is also a change
on the tools side that is required because domain_destroy can now return
-EAGAIN if it gets preempted. Any others will probably become obvious when
you try to backport 15821.

 -- Keir

On 28/4/08 14:45, "Jan Beulich" <jbeulich@novell.com> wrote:
> In (3.0.4-based) SLE10 SP1 we are currently dealing with a (reproducible)
> report of time getting screwed up during domain shutdown. Debugging
> revealed that the PM timer misses at least one overflow (i.e. platform
> time lost about 4 seconds), which subsequently leads to disastrous
> effects.
> 
> Apart from tracking the time calibration, as the (currently) last step of
> narrowing the cause I now made the first processor detecting severe
> anomalies in time flow send an IPI to CPU0 (which is exclusively
> responsible for managing platform time), which appears to prove that
> this CPU is indeed busy processing a domain_kill() request, and namely
> is in the process of tearing down the address spaces of the guest.
> 
> Obviously, the hypervisor''s behavior should not depend on the
amount
> of time needed to free a dead domain''s resources, but the way it
is
> coded (and from doing some code comparison I would conclude that
> while the code has significantly changed, the base characteristic of
> domain shutdown being executed synchronously on the CPU requesting
> so doesn''t appear to have changed - of course, history shows that
I
> may easily overlook something here), and if that CPU happens to be
> CPU0 the whole system will suffer due to the asymmetry of platform
> time handling.
> 
> If I''m indeed not overlooking an important fix in that area, what
would
> be considered a reasonable solution to this? I can imagine (in order of
> my preference)
> 
> - inserting calls to do_softirq() in the put_page_and_type() call
> hierarchy (e.g. in alloc_l2_table() or even alloc_l1_table(), to
> guarantee uniform behavior across sub-architectures; this might help
> address other issues as the same scenario might happen when a
> page table hierarchy gets destroyed at times other than domain
> shutdown); perhaps the same might then also be needed in the
> get_page_type() hierarchy, e.g. in free_l{2,1}_table()
> 
> - simply doing round-robin responsibility of platform time among all
> CPUs (would leave the unlikely UP case as still affected by the problem)
> 
> - detecting platform timer overflow (and properly estimating how many
> times it has overflowed) and sync-ing platform time back from local time
> (as indicated in a comment somewhere)
> 
> - marshalling the whole operation to another CPU
> 
> For reference, this is the CPU0 backtrace I''m getting from the
IPI:
> 
> (XEN) *** Dumping CPU0 host state: ***
> (XEN) State at keyhandler.c:109
> (XEN) ----[ Xen-3.0.4_13138-0.63  x86_64  debug=n  Not tainted ]----
> (XEN) CPU:    0
> (XEN) RIP:    e008:[<ffff83000010e8a2>] dump_execstate+0x62/0xe0
> (XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor
> (XEN) rax: 0000000000000000   rbx: 0000000000000000   rcx: 000000000013dd62
> (XEN) rdx: 000000000000000a   rsi: 000000000000000a   rdi: ffff8300002b2142
> (XEN) rbp: 0000000000000000   rsp: ffff8300001d3a30   r8:  0000000000000001
> (XEN) r9:  0000000000000001   r10: 00000000fffffffc   r11: 0000000000000001
> (XEN) r12: 0000000000000001   r13: 0000000000000001   r14: 0000000000000001
> (XEN) r15: cccccccccccccccd   cr0: 0000000080050033   cr4: 00000000000006f0
> (XEN) cr3: 000000000ce02000   cr2: 00002b47f8871ca8
> (XEN) ds: 0000   es: 0000   fs: 0063   gs: 0000   ss: e010   cs: e008
> (XEN) Xen stack trace from rsp=ffff8300001d3a30:
> (XEN)    0000000000000046 ffff830000f7e280 ffff8300002b0e00
ffff830000f7e280
> (XEN)    ffff83000013b665 0000000000000000 ffff83000012dc8a
cccccccccccccccd
> (XEN)    0000000000000001 0000000000000001 0000000000000001
ffff830000f7e280
> (XEN)    0000000000000000 0000000000000000 0000000000000000
0000000000000000
> (XEN)    ffff8284008f7aa0 ffff8284008f7ac8 0000000000000000
0000000000000000
> (XEN)    0000000000039644 ffff8284008f7aa0 000000fb00000000
ffff83000011345d
> (XEN)    000000000000e008 0000000000000246 ffff8300001d3b18
000000000000e010
> (XEN)    ffff830000113348 ffff83000013327f 0000000000000000
ffff8284008f7aa0
> (XEN)    ffff8307cc1b7288 ffff8307cc1b8000 ffff830000f7e280
00000000007cc315
> (XEN)    ffff8284137e4498 ffff830000f7e280 ffff830000132c24
0000000020000001
> (XEN)    0000000020000000 ffff8284137e4498 00000000007cc315
ffff8284137e7b48
> (XEN)    ffff830000132ec4 ffff8284137e4498 000000000000015d
ffff830000f7e280
> (XEN)    ffff8300001328d2 ffff8307cc315ae8 ffff830000132cbb
0000000040000001
> (XEN)    0000000040000000 ffff8284137e7b48 ffff830000f7e280
ffff8284137f6be8
> (XEN)    ffff830000132ec4 ffff8284137e7b48 00000000007cc919
ffff8307cc91a000
> (XEN)    ffff8300001331a2 ffff8307cc919018 ffff830000132d41
0000000060000001
> (XEN)    0000000060000000 ffff8284137f6be8 0000000000006ea6
ffff8284001149f0
> (XEN)    ffff830000132ec4 ffff8284137f6be8 0000000000000110
ffff830000f7e280
> (XEN)    ffff830000133132 ffff830006ea6880 ffff830000132df0
0000000080000001
> (XEN)    0000000080000000 ffff8284001149f0 ffff8284001149f0
ffff8284001149f0
> (XEN) Xen call trace:
> (XEN)    [<ffff83000010e8a2>] dump_execstate+0x62/0xe0
> (XEN)    [<ffff83000013b665>] smp_call_function_interrupt+0x55/0xa0
> (XEN)    [<ffff83000012dc8a>] call_function_interrupt+0x2a/0x30
> (XEN)    [<ffff83000011345d>] free_domheap_pages+0x2bd/0x3b0
> (XEN)    [<ffff830000113348>] free_domheap_pages+0x1a8/0x3b0
> (XEN)    [<ffff83000013327f>] put_page_from_l1e+0x9f/0x120
> (XEN)    [<ffff830000132c24>] free_page_type+0x314/0x540
> (XEN)    [<ffff830000132ec4>] put_page_type+0x74/0xf0
> (XEN)    [<ffff8300001328d2>] put_page_from_l2e+0x32/0x70
> (XEN)    [<ffff830000132cbb>] free_page_type+0x3ab/0x540
> (XEN)    [<ffff830000132ec4>] put_page_type+0x74/0xf0
> (XEN)    [<ffff8300001331a2>] put_page_from_l3e+0x32/0x70
> (XEN)    [<ffff830000132d41>] free_page_type+0x431/0x540
> (XEN)    [<ffff830000132ec4>] put_page_type+0x74/0xf0
> (XEN)    [<ffff830000133132>] put_page_from_l4e+0x32/0x70
> (XEN)    [<ffff830000132df0>] free_page_type+0x4e0/0x540
> (XEN)    [<ffff830000132ec4>] put_page_type+0x74/0xf0
> (XEN)    [<ffff83000012923a>] relinquish_memory+0x17a/0x290
> (XEN)    [<ffff830000183665>] identify_cpu+0x5/0x1f0
> (XEN)    [<ffff830000117f10>] vcpu_runstate_get+0xb0/0xf0
> (XEN)    [<ffff8300001296aa>] domain_relinquish_resources+0x35a/0x3b0
> (XEN)    [<ffff8300001083e8>] domain_kill+0x28/0x60
> (XEN)    [<ffff830000107560>] do_domctl+0x690/0xe60
> (XEN)    [<ffff830000121def>] __putstr+0x1f/0x70
> (XEN)    [<ffff830000138016>] mod_l1_entry+0x636/0x670
> (XEN)    [<ffff830000118143>] schedule+0x1f3/0x270
> (XEN)    [<ffff830000175ca6>] toggle_guest_mode+0x126/0x140
> (XEN)    [<ffff830000175fa8>] do_iret+0xa8/0x1c0
> (XEN)    [<ffff830000173b32>] syscall_enter+0x62/0x67
> 
> Jan
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2008-Apr-28 14:30 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

>>> Keir Fraser <keir.fraser@eu.citrix.com> 28.04.08 15:59
>>>
>This was addressed by xen-unstable:15821. The fix is present in releases
>since 3.2.0. It was never backported to 3.1 branch.
>
>There are a few changesets related to 15821 that you would also want to take
>into your tree. For example, 15838 is a bugfix. And there is also a change
>on the tools side that is required because domain_destroy can now return
>-EAGAIN if it gets preempted. Any others will probably become obvious when
>you try to backport 15821.
>
> -- Keir
Okay, thanks - so I indeed missed the call to hypercall_preempt_check()
in relinquish_memory(), which is the key indicator here.

However, that change deals exclusively with domain shutdown, but not
with the more general page table pinning/unpinning operations, which I
believe are (as described) vulnerable to mis-use by a malicious guest (I
realize that well behaved guests would not normally present a heavily
populated address space here, but it also cannot  be entirely excluded)
- the upper bound to the number of operations on x86-64 is 512**4
or 2**36 l1 table entries (ignoring the hypervisor hole which doesn''t
need processing).

Jan

On 28/4/08 14:45, "Jan Beulich" <jbeulich@novell.com> wrote:
> In (3.0.4-based) SLE10 SP1 we are currently dealing with a (reproducible)
> report of time getting screwed up during domain shutdown. Debugging
> revealed that the PM timer misses at least one overflow (i.e. platform
> time lost about 4 seconds), which subsequently leads to disastrous
> effects.
> 
> Apart from tracking the time calibration, as the (currently) last step of
> narrowing the cause I now made the first processor detecting severe
> anomalies in time flow send an IPI to CPU0 (which is exclusively
> responsible for managing platform time), which appears to prove that
> this CPU is indeed busy processing a domain_kill() request, and namely
> is in the process of tearing down the address spaces of the guest.
> 
> Obviously, the hypervisor''s behavior should not depend on the
amount
> of time needed to free a dead domain''s resources, but the way it
is
> coded (and from doing some code comparison I would conclude that
> while the code has significantly changed, the base characteristic of
> domain shutdown being executed synchronously on the CPU requesting
> so doesn''t appear to have changed - of course, history shows that
I
> may easily overlook something here), and if that CPU happens to be
> CPU0 the whole system will suffer due to the asymmetry of platform
> time handling.
> 
> If I''m indeed not overlooking an important fix in that area, what
would
> be considered a reasonable solution to this? I can imagine (in order of
> my preference)
> 
> - inserting calls to do_softirq() in the put_page_and_type() call
> hierarchy (e.g. in alloc_l2_table() or even alloc_l1_table(), to
> guarantee uniform behavior across sub-architectures; this might help
> address other issues as the same scenario might happen when a
> page table hierarchy gets destroyed at times other than domain
> shutdown); perhaps the same might then also be needed in the
> get_page_type() hierarchy, e.g. in free_l{2,1}_table()
> 
> - simply doing round-robin responsibility of platform time among all
> CPUs (would leave the unlikely UP case as still affected by the problem)
> 
> - detecting platform timer overflow (and properly estimating how many
> times it has overflowed) and sync-ing platform time back from local time
> (as indicated in a comment somewhere)
> 
> - marshalling the whole operation to another CPU
> 
> For reference, this is the CPU0 backtrace I''m getting from the
IPI:
> 
> (XEN) *** Dumping CPU0 host state: ***
> (XEN) State at keyhandler.c:109
> (XEN) ----[ Xen-3.0.4_13138-0.63  x86_64  debug=n  Not tainted ]----
> (XEN) CPU:    0
> (XEN) RIP:    e008:[<ffff83000010e8a2>] dump_execstate+0x62/0xe0
> (XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor
> (XEN) rax: 0000000000000000   rbx: 0000000000000000   rcx: 000000000013dd62
> (XEN) rdx: 000000000000000a   rsi: 000000000000000a   rdi: ffff8300002b2142
> (XEN) rbp: 0000000000000000   rsp: ffff8300001d3a30   r8:  0000000000000001
> (XEN) r9:  0000000000000001   r10: 00000000fffffffc   r11: 0000000000000001
> (XEN) r12: 0000000000000001   r13: 0000000000000001   r14: 0000000000000001
> (XEN) r15: cccccccccccccccd   cr0: 0000000080050033   cr4: 00000000000006f0
> (XEN) cr3: 000000000ce02000   cr2: 00002b47f8871ca8
> (XEN) ds: 0000   es: 0000   fs: 0063   gs: 0000   ss: e010   cs: e008
> (XEN) Xen stack trace from rsp=ffff8300001d3a30:
> (XEN)    0000000000000046 ffff830000f7e280 ffff8300002b0e00
ffff830000f7e280
> (XEN)    ffff83000013b665 0000000000000000 ffff83000012dc8a
cccccccccccccccd
> (XEN)    0000000000000001 0000000000000001 0000000000000001
ffff830000f7e280
> (XEN)    0000000000000000 0000000000000000 0000000000000000
0000000000000000
> (XEN)    ffff8284008f7aa0 ffff8284008f7ac8 0000000000000000
0000000000000000
> (XEN)    0000000000039644 ffff8284008f7aa0 000000fb00000000
ffff83000011345d
> (XEN)    000000000000e008 0000000000000246 ffff8300001d3b18
000000000000e010
> (XEN)    ffff830000113348 ffff83000013327f 0000000000000000
ffff8284008f7aa0
> (XEN)    ffff8307cc1b7288 ffff8307cc1b8000 ffff830000f7e280
00000000007cc315
> (XEN)    ffff8284137e4498 ffff830000f7e280 ffff830000132c24
0000000020000001
> (XEN)    0000000020000000 ffff8284137e4498 00000000007cc315
ffff8284137e7b48
> (XEN)    ffff830000132ec4 ffff8284137e4498 000000000000015d
ffff830000f7e280
> (XEN)    ffff8300001328d2 ffff8307cc315ae8 ffff830000132cbb
0000000040000001
> (XEN)    0000000040000000 ffff8284137e7b48 ffff830000f7e280
ffff8284137f6be8
> (XEN)    ffff830000132ec4 ffff8284137e7b48 00000000007cc919
ffff8307cc91a000
> (XEN)    ffff8300001331a2 ffff8307cc919018 ffff830000132d41
0000000060000001
> (XEN)    0000000060000000 ffff8284137f6be8 0000000000006ea6
ffff8284001149f0
> (XEN)    ffff830000132ec4 ffff8284137f6be8 0000000000000110
ffff830000f7e280
> (XEN)    ffff830000133132 ffff830006ea6880 ffff830000132df0
0000000080000001
> (XEN)    0000000080000000 ffff8284001149f0 ffff8284001149f0
ffff8284001149f0
> (XEN) Xen call trace:
> (XEN)    [<ffff83000010e8a2>] dump_execstate+0x62/0xe0
> (XEN)    [<ffff83000013b665>] smp_call_function_interrupt+0x55/0xa0
> (XEN)    [<ffff83000012dc8a>] call_function_interrupt+0x2a/0x30
> (XEN)    [<ffff83000011345d>] free_domheap_pages+0x2bd/0x3b0
> (XEN)    [<ffff830000113348>] free_domheap_pages+0x1a8/0x3b0
> (XEN)    [<ffff83000013327f>] put_page_from_l1e+0x9f/0x120
> (XEN)    [<ffff830000132c24>] free_page_type+0x314/0x540
> (XEN)    [<ffff830000132ec4>] put_page_type+0x74/0xf0
> (XEN)    [<ffff8300001328d2>] put_page_from_l2e+0x32/0x70
> (XEN)    [<ffff830000132cbb>] free_page_type+0x3ab/0x540
> (XEN)    [<ffff830000132ec4>] put_page_type+0x74/0xf0
> (XEN)    [<ffff8300001331a2>] put_page_from_l3e+0x32/0x70
> (XEN)    [<ffff830000132d41>] free_page_type+0x431/0x540
> (XEN)    [<ffff830000132ec4>] put_page_type+0x74/0xf0
> (XEN)    [<ffff830000133132>] put_page_from_l4e+0x32/0x70
> (XEN)    [<ffff830000132df0>] free_page_type+0x4e0/0x540
> (XEN)    [<ffff830000132ec4>] put_page_type+0x74/0xf0
> (XEN)    [<ffff83000012923a>] relinquish_memory+0x17a/0x290
> (XEN)    [<ffff830000183665>] identify_cpu+0x5/0x1f0
> (XEN)    [<ffff830000117f10>] vcpu_runstate_get+0xb0/0xf0
> (XEN)    [<ffff8300001296aa>] domain_relinquish_resources+0x35a/0x3b0
> (XEN)    [<ffff8300001083e8>] domain_kill+0x28/0x60
> (XEN)    [<ffff830000107560>] do_domctl+0x690/0xe60
> (XEN)    [<ffff830000121def>] __putstr+0x1f/0x70
> (XEN)    [<ffff830000138016>] mod_l1_entry+0x636/0x670
> (XEN)    [<ffff830000118143>] schedule+0x1f3/0x270
> (XEN)    [<ffff830000175ca6>] toggle_guest_mode+0x126/0x140
> (XEN)    [<ffff830000175fa8>] do_iret+0xa8/0x1c0
> (XEN)    [<ffff830000173b32>] syscall_enter+0x62/0x67
> 
> Jan
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com 
> http://lists.xensource.com/xen-devel 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Apr-28 14:42 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

On 28/4/08 15:30, "Jan Beulich" <jbeulich@novell.com> wrote:
> Okay, thanks - so I indeed missed the call to hypercall_preempt_check()
> in relinquish_memory(), which is the key indicator here.
> 
> However, that change deals exclusively with domain shutdown, but not
> with the more general page table pinning/unpinning operations, which I
> believe are (as described) vulnerable to mis-use by a malicious guest (I
> realize that well behaved guests would not normally present a heavily
> populated address space here, but it also cannot  be entirely excluded)
> - the upper bound to the number of operations on x86-64 is 512**4
> or 2**36 l1 table entries (ignoring the hypervisor hole which
doesn''t
> need processing).
True. It turns out to be good enough in practice though.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2008-May-08 09:58 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

>>> Keir Fraser <keir.fraser@eu.citrix.com> 30.04.08 16:26
>>>
>On 30/4/08 15:00, "Jan Beulich" <jbeulich@novell.com> wrote:
>>
>> According to two forced backtraces with about a second delta, the
>> hypervisor is in the process of releasing the 1:1 mapping of the
>> guest kernel and managed, during that one second, to increment
>> i in free_l3_table() by just 1. This would make up for unbelievable
>> 13,600 clocks per l1 entry being freed.
>
>That''s not great. :-) At such a high cost, perhaps some tracing
might
>indicate if we are taking some stupid slow path in free_domheap_page() or
>cleanup_page_cacheattr()? I very much hope that 13600 cycles cannot be
>legitimately accounted for!
I''m afraid it''s really that bad. I used another (local to my
office) machine,
and the numbers aren''t exactly as bad as on the box they were
originally
measured on, but after getting the cumulative clock cycles spent in
free_l1_table() and free_domheap_pages() (and their descendants,
so the former obviously includes a large part of the latter) during the
largest single run of relinquish_memory() I''m getting an average of
3,400 clocks spent in free_domheap_pages() (with all but very few
pages going onto the scrub list) and 8,500 clocks spent per page
table entry (assuming all entries are populated, so the number really
is higher) in free_l1_table().

It''s the relationship between the two numbers that makes me believe
that there''s really this much time spent on it.

For the specific case of cleaning up after a domain, there seems to
be a pretty simple workaround, though: free_l{3,4}_table() can
simply avoid recursing into put_page_from_l{3,4}e() by checking
d->arch.relmem being RELMEM_dom_l{3,4}. This, as expected,
reduces the latency of preempting relinquish_memory() (for a 5G
domU) on the box I tested from about 3s to less than half a second -
if that''s considered still too much, the same kind of check could
of course be added to free_l2_table().

But as there''s no similarly simple mechanism to deal with the DoS
potential in pinning/unpinning or installing L4 (and maybe L3) table
entries, there''ll need to be a way to preempt these call trees
anyway. Since hypercalls cannot nest, storing respective state
in the vcpu structure shouldn''t be a problem, but what I''m
unsure
about is what side effects a partially validated page table might
introduce.

While looking at this I wondered whether there really is a way for
Xen heap pages to end up being guest page tables (or similarly
descriptor table ones)? I would think if that happened this would be
a bug (and perhaps a security issue). If it cannot happen, then the
RELMEM_* states could be simplified and
domain_relinquish_resources() shortened.

(I was traveling, so it took a while to get to do the measurements.)

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-May-08 10:12 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

On 8/5/08 10:58, "Jan Beulich" <jbeulich@novell.com> wrote:
> While looking at this I wondered whether there really is a way for
> Xen heap pages to end up being guest page tables (or similarly
> descriptor table ones)? I would think if that happened this would be
> a bug (and perhaps a security issue). If it cannot happen, then the
> RELMEM_* states could be simplified and
> domain_relinquish_resources() shortened.
You mean just force page-table type counts to zero, and drop main reference
count by the same amount? Might work. Would need some thought.

8500 cycles per pte is pretty fantastic. I suppose a few atomic ops are
involved. Are you running on an old P4? :-)

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2008-May-08 10:39 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

>>> Keir Fraser <keir.fraser@eu.citrix.com> 08.05.08 12:12
>>>
>On 8/5/08 10:58, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>> While looking at this I wondered whether there really is a way for
>> Xen heap pages to end up being guest page tables (or similarly
>> descriptor table ones)? I would think if that happened this would be
>> a bug (and perhaps a security issue). If it cannot happen, then the
>> RELMEM_* states could be simplified and
>> domain_relinquish_resources() shortened.
>
>You mean just force page-table type counts to zero, and drop main reference
>count by the same amount? Might work. Would need some thought.
No, here I mean having just RELMEM_xen and RELMEM_l{1,2,3,4}.
Then simply release Xen pages first, then l4...l1.

For the suggested workaround to reduce latency of relinquish_memory()
preemption, I simply mean utilizing the code to deal with circular
references also for releasing simple ones (that code path doesn''t seem
to care to force the type count to zero, but as I understand that''s no
problem since these pages end up being freed anyway, and that''s
where the whole type_info field gets re-initialized - or was this
happening when the page gets allocated the next time).
>8500 cycles per pte is pretty fantastic. I suppose a few atomic ops are
>involved. Are you running on an old P4? :-)
Not too old, it''s what they called Tulsa as codename (i.e. some of the
about two year old Xeons). But I suppose that generally the bigger
the box (in terms of number of threads/cores/sockets) the higher the
price for atomic ops.

In trying to get a picture, I e.g. measured both the cumulative full
free_domheap_pages()'' and free_l1_table()''s contributions as
well as
just the d != NULL sub-part of free_domheap_pages() - example
results are
0x2a4990400 clocks for full free_l1_table()
0x10f1a3749 clocks for full free_domheap_pages()
0x0ec6748eb clocks for the d != NULL body
Given how little it is that happens outside of that d != NULL body I''m
concluding that the atomic ops are by far not alone responsible for
the long execution time. These are dual-threaded CPUs, however,
so that while the system was doing nothing else I cannot exclude that
the CPUs do something dumb in switching between the threads. But
excluding this possible effect from the picture seems to have little
sense, since we need to be able to deal with the situation anyway.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-May-08 10:52 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

On 8/5/08 11:39, "Jan Beulich" <jbeulich@novell.com> wrote:
> No, here I mean having just RELMEM_xen and RELMEM_l{1,2,3,4}.
> Then simply release Xen pages first, then l4...l1.
> 
> For the suggested workaround to reduce latency of relinquish_memory()
> preemption, I simply mean utilizing the code to deal with circular
> references also for releasing simple ones (that code path doesn''t
seem
> to care to force the type count to zero, but as I understand
that''s no
> problem since these pages end up being freed anyway, and that''s
> where the whole type_info field gets re-initialized - or was this
> happening when the page gets allocated the next time).
You''ve lost me. Either you are confused or I have forgotten the details
of
how that shutdown code works. Either is quite possible I suspect. :-)
Basically I don''t see how this avoids the recursive, and potentially
rather
expensive, teardown. Nor am I convinced about how much potential time-saving
there is to be had here.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2008-May-08 11:13 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

>>> Keir Fraser <keir.fraser@eu.citrix.com> 08.05.08 12:52
>>>
>On 8/5/08 11:39, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>> No, here I mean having just RELMEM_xen and RELMEM_l{1,2,3,4}.
>> Then simply release Xen pages first, then l4...l1.
>> 
>> For the suggested workaround to reduce latency of relinquish_memory()
>> preemption, I simply mean utilizing the code to deal with circular
>> references also for releasing simple ones (that code path
doesn''t seem
>> to care to force the type count to zero, but as I understand
that''s no
>> problem since these pages end up being freed anyway, and
that''s
>> where the whole type_info field gets re-initialized - or was this
>> happening when the page gets allocated the next time).
>
>You''ve lost me. Either you are confused or I have forgotten the
details of
>how that shutdown code works. Either is quite possible I suspect. :-)
>Basically I don''t see how this avoids the recursive, and
potentially rather
>expensive, teardown.
All I mean is a change like this:

--- 2008-05-08.orig/xen/arch/x86/mm.c
+++ 2008-05-08/xen/arch/x86/mm.c
@@ -1341,6 +1341,9 @@ static void free_l3_table(struct page_in
     l3_pgentry_t *pl3e;
     int           i;
 
+    if(d->arch.relmem == RELMEM_dom_l3)
+        return;
+
     pl3e = map_domain_page(pfn);
 
     for ( i = 0; i < L3_PAGETABLE_ENTRIES; i++ )
@@ -1364,6 +1367,9 @@ static void free_l4_table(struct page_in
     l4_pgentry_t *pl4e = page_to_virt(page);
     int           i;
 
+    if(d->arch.relmem == RELMEM_dom_l4)
+        return;
+
     for ( i = 0; i < L4_PAGETABLE_ENTRIES; i++ )
         if ( is_guest_l4_slot(d, i) )
             put_page_from_l4e(pl4e[i], pfn);

I tried it out on SLE10 SP1 (3.0.4 derived), and it appeared to work
and serve the purpose. With this, L3 and L2 tables are no longer
freed recursively upon an L4/L3 one dropping its last type reference,
but they rather get caught by the code that so far was only
responsible for dealing with circular references. The result is that
between individual full L2 tables (including the L1s hanging off of
them) being released there now is a preemption check. Up until now,
when the last L4 table got freed, everything hanging off of it needed
to be dealt with in a single non-preemptible chunk.
>Nor am I convinced about how much potential time-saving
>there is to be had here.
I''m not seeing any time saving here. The other thing I brought up
was just an unrelated item pointing out potential for code
simplification.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-May-08 12:11 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

On 8/5/08 12:13, "Jan Beulich" <jbeulich@novell.com> wrote:
>> Nor am I convinced about how much potential time-saving
>> there is to be had here.
> 
> I''m not seeing any time saving here. The other thing I brought up
> was just an unrelated item pointing out potential for code
> simplification.
Ah, yes, I see.

The approach looks plausible. I think in its current form it will leave
zombie L2/L3 pages hanging around and the domain will never actually
properly die (e.g., still will be visible with the ''q'' key).
Because
although you do get around to doing free_lX_table(), the type count and ref
count of the L2/L3 pages will not drop to zero because the dead L3/L4 page
never actually dropped its references properly.

In actuality, since we know that we never have ''cross-domain''
pagetable type
references, we should actually be able to zap pagetable reference counts to
zero. The only reason we don''t do that right now is really because it
provides good debugging info to see whether a domain''s refcounts have
got
screwed up. But that would not prevent us doing something faster for NDEBUG
builds, at least.

Does that make sense?

 -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2008-May-08 12:33 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

>>> Keir Fraser <keir.fraser@eu.citrix.com> 08.05.08 14:11
>>>
>The approach looks plausible. I think in its current form it will leave
>zombie L2/L3 pages hanging around and the domain will never actually
>properly die (e.g., still will be visible with the ''q''
key). Because
>although you do get around to doing free_lX_table(), the type count and ref
>count of the L2/L3 pages will not drop to zero because the dead L3/L4 page
>never actually dropped its references properly.
Hmm, indeed, I should look for this after the next run.
>In actuality, since we know that we never have
''cross-domain'' pagetable type
>references, we should actually be able to zap pagetable reference counts to
>zero. The only reason we don''t do that right now is really because
it
>provides good debugging info to see whether a domain''s refcounts
have got
>screwed up. But that would not prevent us doing something faster for NDEBUG
>builds, at least.
>
>Does that make sense?
Yes, except for me not immediately seeing why this is then not also a
problem for the current circular reference handling.

But really, rather than introducing (and fixing) the hack here I''d much
prefer a generic solution to the problem, and you didn''t say a word on
the thoughts I had on that (but in a mail a couple of days ago you
indicated you might get around doing something in that area yourself,
so I half way implied you may have a mechanism in mind already).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-May-08 12:36 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

On 8/5/08 13:33, "Jan Beulich" <jbeulich@novell.com> wrote:
>> In actuality, since we know that we never have
''cross-domain'' pagetable type
>> references, we should actually be able to zap pagetable reference
counts to
>> zero. The only reason we don''t do that right now is really
because it
>> provides good debugging info to see whether a domain''s
refcounts have got
>> screwed up. But that would not prevent us doing something faster for
NDEBUG
>> builds, at least.
>> 
>> Does that make sense?
> 
> Yes, except for me not immediately seeing why this is then not also a
> problem for the current circular reference handling.
Because ultimately the reference(s) that are still being held on the page we
are unvalidating and calling free_lX_table() on will get dropped, due to the
fact we are breaking the circular chain and calling
free_lX_table()->put_page_and_type()->...
> But really, rather than introducing (and fixing) the hack here I''d
much
> prefer a generic solution to the problem, and you didn''t say a
word on
> the thoughts I had on that (but in a mail a couple of days ago you
> indicated you might get around doing something in that area yourself,
> so I half way implied you may have a mechanism in mind already).
I don''t have a very clear plan, except that some kind of continuation
(basically encoding of how far we got) must be encoded in the page_info
structure. We should be able to find spare bits for a page which is in this
in-between state.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2008-May-08 14:29 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

>>> Keir Fraser <keir.fraser@eu.citrix.com> 08.05.08 14:36
>>>
>> But really, rather than introducing (and fixing) the hack here
I''d much
>> prefer a generic solution to the problem, and you didn''t say a
word on
>> the thoughts I had on that (but in a mail a couple of days ago you
>> indicated you might get around doing something in that area yourself,
>> so I half way implied you may have a mechanism in mind already).
>
>I don''t have a very clear plan, except that some kind of
continuation
>(basically encoding of how far we got) must be encoded in the page_info
>structure. We should be able to find spare bits for a page which is in this
>in-between state.
Hmm, storing this in page_info seems questionable to me. It''d be at
least 18 bits (on x86-64) that we''d need. I think this rather has to go
into struct vcpu.

But what worries me more is that (obviously) any affected page will
have to have its PGT_validated bit kept clear, which could lead to
undesirable latencies in spin loops on other vcpus waiting for it to
become set. In the worst case this could lead to deadlocks (at least
in the UP case or when multiple vCPU-s of one guest are pinned to
the same physical CPU) afaics. Perhaps this part could indeed be
addressed with a new PGT_* bit, upon which waiters could exit
their spin loops and consider themselves preempted.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-May-08 14:38 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

On 8/5/08 15:29, "Jan Beulich" <jbeulich@novell.com> wrote:
> Hmm, storing this in page_info seems questionable to me. It''d be
at
> least 18 bits (on x86-64) that we''d need. I think this rather has
to go
> into struct vcpu.
We can, for example, reuse tlbflush_timestamp for this purpose. Stick it in
the vcpu structure and I think we make life hard for ourselves. What if the
guest does not resume the hypercall, for example? What if the guest goes and
tries to execute a different hypercall instead?
> But what worries me more is that (obviously) any affected page will
> have to have its PGT_validated bit kept clear, which could lead to
> undesirable latencies in spin loops on other vcpus waiting for it to
> become set. In the worst case this could lead to deadlocks (at least
> in the UP case or when multiple vCPU-s of one guest are pinned to
> the same physical CPU) afaics. Perhaps this part could indeed be
> addressed with a new PGT_* bit, upon which waiters could exit
> their spin loops and consider themselves preempted.
Yes, the page state machine does need some more careful thought. I''m
pretty
sure we have enough page state bits though.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2008-May-09 10:23 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

>>> Keir Fraser <keir.fraser@eu.citrix.com> 08.05.08 14:11
>>>
>On 8/5/08 12:13, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>>> Nor am I convinced about how much potential time-saving
>>> there is to be had here.
>> 
>> I''m not seeing any time saving here. The other thing I brought
up
>> was just an unrelated item pointing out potential for code
>> simplification.
>
>Ah, yes, I see.
>
>The approach looks plausible. I think in its current form it will leave
>zombie L2/L3 pages hanging around and the domain will never actually
>properly die (e.g., still will be visible with the ''q''
key). Because
>although you do get around to doing free_lX_table(), the type count and ref
>count of the L2/L3 pages will not drop to zero because the dead L3/L4 page
>never actually dropped its references properly.
Indeed, the extended version below avoids this.
>In actuality, since we know that we never have
''cross-domain'' pagetable type
>references, we should actually be able to zap pagetable reference counts to
>zero. The only reason we don''t do that right now is really because
it
>provides good debugging info to see whether a domain''s refcounts
have got
>screwed up. But that would not prevent us doing something faster for NDEBUG
>builds, at least.
I still thought it''d be better to not simply zap the counts, but
incrementally drop them using the proper interface:

Index: 2008-05-08/xen/arch/x86/domain.c
==================================================================---
2008-05-08.orig/xen/arch/x86/domain.c	2008-05-07 12:21:36.000000000 +0200
+++ 2008-05-08/xen/arch/x86/domain.c	2008-05-09 12:05:18.000000000 +0200
@@ -1725,6 +1725,23 @@ static int relinquish_memory(
         if ( test_and_clear_bit(_PGC_allocated, &page->count_info) )
             put_page(page);
 
+        y = page->u.inuse.type_info;
+
+        /*
+         * Forcibly drop reference counts of page tables above top most (which
+         * were skipped to prevent long latencies due to deep recursion - see
+         * the special treatment in free_lX_table()).
+         */
+        if ( type < PGT_root_page_table &&
+             unlikely(((y + PGT_type_mask) &
+                      (PGT_type_mask|PGT_validated)) == type) ) {
+            BUG_ON((y & PGT_count_mask) >= (page->count_info &
PGC_count_mask));
+            while ( y & PGT_count_mask ) {
+                put_page_and_type(page);
+                y = page->u.inuse.type_info;
+            }
+        }
+
         /*
          * Forcibly invalidate top-most, still valid page tables at this point
          * to break circular ''linear page table'' references.
This is okay
@@ -1732,7 +1749,6 @@ static int relinquish_memory(
          * is now dead. Thus top-most valid tables are not in use so a non-zero
          * count means circular reference.
          */
-        y = page->u.inuse.type_info;
         for ( ; ; )
         {
             x = y;
@@ -1896,6 +1912,9 @@ int domain_relinquish_resources(struct d
         /* fallthrough */
 
     case RELMEM_done:
+        ret = relinquish_memory(d, &d->page_list, PGT_l1_page_table);
+        if ( ret )
+            return ret;
         break;
 
     default:
Index: 2008-05-08/xen/arch/x86/mm.c
==================================================================---
2008-05-08.orig/xen/arch/x86/mm.c	2008-05-08 12:13:40.000000000 +0200
+++ 2008-05-08/xen/arch/x86/mm.c	2008-05-08 13:04:13.000000000 +0200
@@ -1341,6 +1341,9 @@ static void free_l3_table(struct page_in
     l3_pgentry_t *pl3e;
     int           i;
 
+    if(d->arch.relmem == RELMEM_dom_l3)
+        return;
+
     pl3e = map_domain_page(pfn);
 
     for ( i = 0; i < L3_PAGETABLE_ENTRIES; i++ )
@@ -1364,6 +1367,9 @@ static void free_l4_table(struct page_in
     l4_pgentry_t *pl4e = page_to_virt(page);
     int           i;
 
+    if(d->arch.relmem == RELMEM_dom_l4)
+        return;
+
     for ( i = 0; i < L4_PAGETABLE_ENTRIES; i++ )
         if ( is_guest_l4_slot(d, i) )
             put_page_from_l4e(pl4e[i], pfn);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-May-09 10:29 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

On 9/5/08 11:23, "Jan Beulich" <jbeulich@novell.com> wrote:
> Indeed, the extended version below avoids this.
> 
>> In actuality, since we know that we never have
''cross-domain'' pagetable type
>> references, we should actually be able to zap pagetable reference
counts to
>> zero. The only reason we don''t do that right now is really
because it
>> provides good debugging info to see whether a domain''s
refcounts have got
>> screwed up. But that would not prevent us doing something faster for
NDEBUG
>> builds, at least.
> 
> I still thought it''d be better to not simply zap the counts, but
> incrementally drop them using the proper interface:
Theoretically you can still race PIN_Lx_TABLE hypercalls from other dom0
VCPUs. Obviously that would only happen from a misbehaving dom0 though. I
think this patch is a reasonable stopgap measure.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2008-May-14 15:54 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

>>> Keir Fraser <keir.fraser@eu.citrix.com> 28.04.08 16:42
>>>
>On 28/4/08 15:30, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>> Okay, thanks - so I indeed missed the call to hypercall_preempt_check()
>> in relinquish_memory(), which is the key indicator here.
>> 
>> However, that change deals exclusively with domain shutdown, but not
>> with the more general page table pinning/unpinning operations, which I
>> believe are (as described) vulnerable to mis-use by a malicious guest
(I
>> realize that well behaved guests would not normally present a heavily
>> populated address space here, but it also cannot  be entirely excluded)
>> - the upper bound to the number of operations on x86-64 is 512**4
>> or 2**36 l1 table entries (ignoring the hypervisor hole which
doesn''t
>> need processing).
>
>True. It turns out to be good enough in practice though.
I''m afraid that''s not the case - after they are now using the
domain
shutdown fix successfully, they upgraded the machine to 64G and
the system fails to boot. Sounds exactly like other reports we had on
the list regarding boot failures with lots of memory that can be avoided
using dom0_mem=<much smaller value>. As I understand it, this is
due to the way the kernel creates its 1:1 mapping - the hypervisor has
to validate the whole tree from each L4 entry being installed in a single
step - for a 4G machine I measured half a second for this operation, so
obviously anything beyond 32G is open for problems when the PM timer
is in use.

Unless you tell me that this is on your very short term agenda to work on,
I''ll make an attempt at finding a reasonable solution starting
tomorrow.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2008-May-14 16:05 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

>>> "Jan Beulich" <jbeulich@novell.com> 14.05.08 17:54
>>>
>>>> Keir Fraser <keir.fraser@eu.citrix.com> 28.04.08 16:42
>>>
>>On 28/4/08 15:30, "Jan Beulich" <jbeulich@novell.com>
wrote:
>>
>>> Okay, thanks - so I indeed missed the call to
hypercall_preempt_check()
>>> in relinquish_memory(), which is the key indicator here.
>>> 
>>> However, that change deals exclusively with domain shutdown, but
not
>>> with the more general page table pinning/unpinning operations,
which I
>>> believe are (as described) vulnerable to mis-use by a malicious
guest (I
>>> realize that well behaved guests would not normally present a
heavily
>>> populated address space here, but it also cannot  be entirely
excluded)
>>> - the upper bound to the number of operations on x86-64 is 512**4
>>> or 2**36 l1 table entries (ignoring the hypervisor hole which
doesn''t
>>> need processing).
>>
>>True. It turns out to be good enough in practice though.
>
>I''m afraid that''s not the case - after they are now using
the domain
>shutdown fix successfully, they upgraded the machine to 64G and
>the system fails to boot. Sounds exactly like other reports we had on
>the list regarding boot failures with lots of memory that can be avoided
>using dom0_mem=<much smaller value>. As I understand it, this is
>due to the way the kernel creates its 1:1 mapping - the hypervisor has
>to validate the whole tree from each L4 entry being installed in a single
>step - for a 4G machine I measured half a second for this operation, so
Sorry, I meant to write 1/8th of a second. But that''s on a small (and
hence memory-wise faster) machine. Didn''t measure on my bigger box,
yet.
>obviously anything beyond 32G is open for problems when the PM timer
>is in use.
This number wasn''t consistent then either - the boundary would rather
be around 64G on that system, but obviously lower on others.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-May-14 16:08 UTC

head link

Re: [Xen-devel] long latency of domain shutdown

On 14/5/08 16:54, "Jan Beulich" <jbeulich@novell.com> wrote:
> I''m afraid that''s not the case - after they are now using
the domain
> shutdown fix successfully, they upgraded the machine to 64G and
> the system fails to boot. Sounds exactly like other reports we had on
> the list regarding boot failures with lots of memory that can be avoided
> using dom0_mem=<much smaller value>. As I understand it, this is
> due to the way the kernel creates its 1:1 mapping - the hypervisor has
> to validate the whole tree from each L4 entry being installed in a single
> step - for a 4G machine I measured half a second for this operation, so
> obviously anything beyond 32G is open for problems when the PM timer
> is in use.
Hmm, yes that makes sense. 32GB is 8M ptes, so I could imagine that taking a
while to validate. Anyhow this obviously needs fixing regardless of the
specific details of this specific failure case.
> Unless you tell me that this is on your very short term agenda to work on,
> I''ll make an attempt at finding a reasonable solution starting
tomorrow.
Yes, I''ll sort this one out hopefully by next week. I think this can be
solved pretty straightforwardly. It''s the encoding of the continuation
into
the page_info structure, and synchronisation of that, that needs some
back-of-envelope thought. As long as there are not too many callers of
{get,put}_page_type(L{2,3,4}_pagetable), and I don''t think we have that
many, then the changes should be pretty localised. Only those callers have
to deal with ''EAGAIN'' (or equivalent).

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Apr 2008 - long latency of domain shutdown

[Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown

Re: [Xen-devel] long latency of domain shutdown