In (3.0.4-based) SLE10 SP1 we are currently dealing with a (reproducible) report of time getting screwed up during domain shutdown. Debugging revealed that the PM timer misses at least one overflow (i.e. platform time lost about 4 seconds), which subsequently leads to disastrous effects. Apart from tracking the time calibration, as the (currently) last step of narrowing the cause I now made the first processor detecting severe anomalies in time flow send an IPI to CPU0 (which is exclusively responsible for managing platform time), which appears to prove that this CPU is indeed busy processing a domain_kill() request, and namely is in the process of tearing down the address spaces of the guest. Obviously, the hypervisor''s behavior should not depend on the amount of time needed to free a dead domain''s resources, but the way it is coded (and from doing some code comparison I would conclude that while the code has significantly changed, the base characteristic of domain shutdown being executed synchronously on the CPU requesting so doesn''t appear to have changed - of course, history shows that I may easily overlook something here), and if that CPU happens to be CPU0 the whole system will suffer due to the asymmetry of platform time handling. If I''m indeed not overlooking an important fix in that area, what would be considered a reasonable solution to this? I can imagine (in order of my preference) - inserting calls to do_softirq() in the put_page_and_type() call hierarchy (e.g. in alloc_l2_table() or even alloc_l1_table(), to guarantee uniform behavior across sub-architectures; this might help address other issues as the same scenario might happen when a page table hierarchy gets destroyed at times other than domain shutdown); perhaps the same might then also be needed in the get_page_type() hierarchy, e.g. in free_l{2,1}_table() - simply doing round-robin responsibility of platform time among all CPUs (would leave the unlikely UP case as still affected by the problem) - detecting platform timer overflow (and properly estimating how many times it has overflowed) and sync-ing platform time back from local time (as indicated in a comment somewhere) - marshalling the whole operation to another CPU For reference, this is the CPU0 backtrace I''m getting from the IPI: (XEN) *** Dumping CPU0 host state: *** (XEN) State at keyhandler.c:109 (XEN) ----[ Xen-3.0.4_13138-0.63 x86_64 debug=n Not tainted ]---- (XEN) CPU: 0 (XEN) RIP: e008:[<ffff83000010e8a2>] dump_execstate+0x62/0xe0 (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor (XEN) rax: 0000000000000000 rbx: 0000000000000000 rcx: 000000000013dd62 (XEN) rdx: 000000000000000a rsi: 000000000000000a rdi: ffff8300002b2142 (XEN) rbp: 0000000000000000 rsp: ffff8300001d3a30 r8: 0000000000000001 (XEN) r9: 0000000000000001 r10: 00000000fffffffc r11: 0000000000000001 (XEN) r12: 0000000000000001 r13: 0000000000000001 r14: 0000000000000001 (XEN) r15: cccccccccccccccd cr0: 0000000080050033 cr4: 00000000000006f0 (XEN) cr3: 000000000ce02000 cr2: 00002b47f8871ca8 (XEN) ds: 0000 es: 0000 fs: 0063 gs: 0000 ss: e010 cs: e008 (XEN) Xen stack trace from rsp=ffff8300001d3a30: (XEN) 0000000000000046 ffff830000f7e280 ffff8300002b0e00 ffff830000f7e280 (XEN) ffff83000013b665 0000000000000000 ffff83000012dc8a cccccccccccccccd (XEN) 0000000000000001 0000000000000001 0000000000000001 ffff830000f7e280 (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 (XEN) ffff8284008f7aa0 ffff8284008f7ac8 0000000000000000 0000000000000000 (XEN) 0000000000039644 ffff8284008f7aa0 000000fb00000000 ffff83000011345d (XEN) 000000000000e008 0000000000000246 ffff8300001d3b18 000000000000e010 (XEN) ffff830000113348 ffff83000013327f 0000000000000000 ffff8284008f7aa0 (XEN) ffff8307cc1b7288 ffff8307cc1b8000 ffff830000f7e280 00000000007cc315 (XEN) ffff8284137e4498 ffff830000f7e280 ffff830000132c24 0000000020000001 (XEN) 0000000020000000 ffff8284137e4498 00000000007cc315 ffff8284137e7b48 (XEN) ffff830000132ec4 ffff8284137e4498 000000000000015d ffff830000f7e280 (XEN) ffff8300001328d2 ffff8307cc315ae8 ffff830000132cbb 0000000040000001 (XEN) 0000000040000000 ffff8284137e7b48 ffff830000f7e280 ffff8284137f6be8 (XEN) ffff830000132ec4 ffff8284137e7b48 00000000007cc919 ffff8307cc91a000 (XEN) ffff8300001331a2 ffff8307cc919018 ffff830000132d41 0000000060000001 (XEN) 0000000060000000 ffff8284137f6be8 0000000000006ea6 ffff8284001149f0 (XEN) ffff830000132ec4 ffff8284137f6be8 0000000000000110 ffff830000f7e280 (XEN) ffff830000133132 ffff830006ea6880 ffff830000132df0 0000000080000001 (XEN) 0000000080000000 ffff8284001149f0 ffff8284001149f0 ffff8284001149f0 (XEN) Xen call trace: (XEN) [<ffff83000010e8a2>] dump_execstate+0x62/0xe0 (XEN) [<ffff83000013b665>] smp_call_function_interrupt+0x55/0xa0 (XEN) [<ffff83000012dc8a>] call_function_interrupt+0x2a/0x30 (XEN) [<ffff83000011345d>] free_domheap_pages+0x2bd/0x3b0 (XEN) [<ffff830000113348>] free_domheap_pages+0x1a8/0x3b0 (XEN) [<ffff83000013327f>] put_page_from_l1e+0x9f/0x120 (XEN) [<ffff830000132c24>] free_page_type+0x314/0x540 (XEN) [<ffff830000132ec4>] put_page_type+0x74/0xf0 (XEN) [<ffff8300001328d2>] put_page_from_l2e+0x32/0x70 (XEN) [<ffff830000132cbb>] free_page_type+0x3ab/0x540 (XEN) [<ffff830000132ec4>] put_page_type+0x74/0xf0 (XEN) [<ffff8300001331a2>] put_page_from_l3e+0x32/0x70 (XEN) [<ffff830000132d41>] free_page_type+0x431/0x540 (XEN) [<ffff830000132ec4>] put_page_type+0x74/0xf0 (XEN) [<ffff830000133132>] put_page_from_l4e+0x32/0x70 (XEN) [<ffff830000132df0>] free_page_type+0x4e0/0x540 (XEN) [<ffff830000132ec4>] put_page_type+0x74/0xf0 (XEN) [<ffff83000012923a>] relinquish_memory+0x17a/0x290 (XEN) [<ffff830000183665>] identify_cpu+0x5/0x1f0 (XEN) [<ffff830000117f10>] vcpu_runstate_get+0xb0/0xf0 (XEN) [<ffff8300001296aa>] domain_relinquish_resources+0x35a/0x3b0 (XEN) [<ffff8300001083e8>] domain_kill+0x28/0x60 (XEN) [<ffff830000107560>] do_domctl+0x690/0xe60 (XEN) [<ffff830000121def>] __putstr+0x1f/0x70 (XEN) [<ffff830000138016>] mod_l1_entry+0x636/0x670 (XEN) [<ffff830000118143>] schedule+0x1f3/0x270 (XEN) [<ffff830000175ca6>] toggle_guest_mode+0x126/0x140 (XEN) [<ffff830000175fa8>] do_iret+0xa8/0x1c0 (XEN) [<ffff830000173b32>] syscall_enter+0x62/0x67 Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
This was addressed by xen-unstable:15821. The fix is present in releases since 3.2.0. It was never backported to 3.1 branch. There are a few changesets related to 15821 that you would also want to take into your tree. For example, 15838 is a bugfix. And there is also a change on the tools side that is required because domain_destroy can now return -EAGAIN if it gets preempted. Any others will probably become obvious when you try to backport 15821. -- Keir On 28/4/08 14:45, "Jan Beulich" <jbeulich@novell.com> wrote:> In (3.0.4-based) SLE10 SP1 we are currently dealing with a (reproducible) > report of time getting screwed up during domain shutdown. Debugging > revealed that the PM timer misses at least one overflow (i.e. platform > time lost about 4 seconds), which subsequently leads to disastrous > effects. > > Apart from tracking the time calibration, as the (currently) last step of > narrowing the cause I now made the first processor detecting severe > anomalies in time flow send an IPI to CPU0 (which is exclusively > responsible for managing platform time), which appears to prove that > this CPU is indeed busy processing a domain_kill() request, and namely > is in the process of tearing down the address spaces of the guest. > > Obviously, the hypervisor''s behavior should not depend on the amount > of time needed to free a dead domain''s resources, but the way it is > coded (and from doing some code comparison I would conclude that > while the code has significantly changed, the base characteristic of > domain shutdown being executed synchronously on the CPU requesting > so doesn''t appear to have changed - of course, history shows that I > may easily overlook something here), and if that CPU happens to be > CPU0 the whole system will suffer due to the asymmetry of platform > time handling. > > If I''m indeed not overlooking an important fix in that area, what would > be considered a reasonable solution to this? I can imagine (in order of > my preference) > > - inserting calls to do_softirq() in the put_page_and_type() call > hierarchy (e.g. in alloc_l2_table() or even alloc_l1_table(), to > guarantee uniform behavior across sub-architectures; this might help > address other issues as the same scenario might happen when a > page table hierarchy gets destroyed at times other than domain > shutdown); perhaps the same might then also be needed in the > get_page_type() hierarchy, e.g. in free_l{2,1}_table() > > - simply doing round-robin responsibility of platform time among all > CPUs (would leave the unlikely UP case as still affected by the problem) > > - detecting platform timer overflow (and properly estimating how many > times it has overflowed) and sync-ing platform time back from local time > (as indicated in a comment somewhere) > > - marshalling the whole operation to another CPU > > For reference, this is the CPU0 backtrace I''m getting from the IPI: > > (XEN) *** Dumping CPU0 host state: *** > (XEN) State at keyhandler.c:109 > (XEN) ----[ Xen-3.0.4_13138-0.63 x86_64 debug=n Not tainted ]---- > (XEN) CPU: 0 > (XEN) RIP: e008:[<ffff83000010e8a2>] dump_execstate+0x62/0xe0 > (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor > (XEN) rax: 0000000000000000 rbx: 0000000000000000 rcx: 000000000013dd62 > (XEN) rdx: 000000000000000a rsi: 000000000000000a rdi: ffff8300002b2142 > (XEN) rbp: 0000000000000000 rsp: ffff8300001d3a30 r8: 0000000000000001 > (XEN) r9: 0000000000000001 r10: 00000000fffffffc r11: 0000000000000001 > (XEN) r12: 0000000000000001 r13: 0000000000000001 r14: 0000000000000001 > (XEN) r15: cccccccccccccccd cr0: 0000000080050033 cr4: 00000000000006f0 > (XEN) cr3: 000000000ce02000 cr2: 00002b47f8871ca8 > (XEN) ds: 0000 es: 0000 fs: 0063 gs: 0000 ss: e010 cs: e008 > (XEN) Xen stack trace from rsp=ffff8300001d3a30: > (XEN) 0000000000000046 ffff830000f7e280 ffff8300002b0e00 ffff830000f7e280 > (XEN) ffff83000013b665 0000000000000000 ffff83000012dc8a cccccccccccccccd > (XEN) 0000000000000001 0000000000000001 0000000000000001 ffff830000f7e280 > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > (XEN) ffff8284008f7aa0 ffff8284008f7ac8 0000000000000000 0000000000000000 > (XEN) 0000000000039644 ffff8284008f7aa0 000000fb00000000 ffff83000011345d > (XEN) 000000000000e008 0000000000000246 ffff8300001d3b18 000000000000e010 > (XEN) ffff830000113348 ffff83000013327f 0000000000000000 ffff8284008f7aa0 > (XEN) ffff8307cc1b7288 ffff8307cc1b8000 ffff830000f7e280 00000000007cc315 > (XEN) ffff8284137e4498 ffff830000f7e280 ffff830000132c24 0000000020000001 > (XEN) 0000000020000000 ffff8284137e4498 00000000007cc315 ffff8284137e7b48 > (XEN) ffff830000132ec4 ffff8284137e4498 000000000000015d ffff830000f7e280 > (XEN) ffff8300001328d2 ffff8307cc315ae8 ffff830000132cbb 0000000040000001 > (XEN) 0000000040000000 ffff8284137e7b48 ffff830000f7e280 ffff8284137f6be8 > (XEN) ffff830000132ec4 ffff8284137e7b48 00000000007cc919 ffff8307cc91a000 > (XEN) ffff8300001331a2 ffff8307cc919018 ffff830000132d41 0000000060000001 > (XEN) 0000000060000000 ffff8284137f6be8 0000000000006ea6 ffff8284001149f0 > (XEN) ffff830000132ec4 ffff8284137f6be8 0000000000000110 ffff830000f7e280 > (XEN) ffff830000133132 ffff830006ea6880 ffff830000132df0 0000000080000001 > (XEN) 0000000080000000 ffff8284001149f0 ffff8284001149f0 ffff8284001149f0 > (XEN) Xen call trace: > (XEN) [<ffff83000010e8a2>] dump_execstate+0x62/0xe0 > (XEN) [<ffff83000013b665>] smp_call_function_interrupt+0x55/0xa0 > (XEN) [<ffff83000012dc8a>] call_function_interrupt+0x2a/0x30 > (XEN) [<ffff83000011345d>] free_domheap_pages+0x2bd/0x3b0 > (XEN) [<ffff830000113348>] free_domheap_pages+0x1a8/0x3b0 > (XEN) [<ffff83000013327f>] put_page_from_l1e+0x9f/0x120 > (XEN) [<ffff830000132c24>] free_page_type+0x314/0x540 > (XEN) [<ffff830000132ec4>] put_page_type+0x74/0xf0 > (XEN) [<ffff8300001328d2>] put_page_from_l2e+0x32/0x70 > (XEN) [<ffff830000132cbb>] free_page_type+0x3ab/0x540 > (XEN) [<ffff830000132ec4>] put_page_type+0x74/0xf0 > (XEN) [<ffff8300001331a2>] put_page_from_l3e+0x32/0x70 > (XEN) [<ffff830000132d41>] free_page_type+0x431/0x540 > (XEN) [<ffff830000132ec4>] put_page_type+0x74/0xf0 > (XEN) [<ffff830000133132>] put_page_from_l4e+0x32/0x70 > (XEN) [<ffff830000132df0>] free_page_type+0x4e0/0x540 > (XEN) [<ffff830000132ec4>] put_page_type+0x74/0xf0 > (XEN) [<ffff83000012923a>] relinquish_memory+0x17a/0x290 > (XEN) [<ffff830000183665>] identify_cpu+0x5/0x1f0 > (XEN) [<ffff830000117f10>] vcpu_runstate_get+0xb0/0xf0 > (XEN) [<ffff8300001296aa>] domain_relinquish_resources+0x35a/0x3b0 > (XEN) [<ffff8300001083e8>] domain_kill+0x28/0x60 > (XEN) [<ffff830000107560>] do_domctl+0x690/0xe60 > (XEN) [<ffff830000121def>] __putstr+0x1f/0x70 > (XEN) [<ffff830000138016>] mod_l1_entry+0x636/0x670 > (XEN) [<ffff830000118143>] schedule+0x1f3/0x270 > (XEN) [<ffff830000175ca6>] toggle_guest_mode+0x126/0x140 > (XEN) [<ffff830000175fa8>] do_iret+0xa8/0x1c0 > (XEN) [<ffff830000173b32>] syscall_enter+0x62/0x67 > > Jan > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 28.04.08 15:59 >>> >This was addressed by xen-unstable:15821. The fix is present in releases >since 3.2.0. It was never backported to 3.1 branch. > >There are a few changesets related to 15821 that you would also want to take >into your tree. For example, 15838 is a bugfix. And there is also a change >on the tools side that is required because domain_destroy can now return >-EAGAIN if it gets preempted. Any others will probably become obvious when >you try to backport 15821. > > -- KeirOkay, thanks - so I indeed missed the call to hypercall_preempt_check() in relinquish_memory(), which is the key indicator here. However, that change deals exclusively with domain shutdown, but not with the more general page table pinning/unpinning operations, which I believe are (as described) vulnerable to mis-use by a malicious guest (I realize that well behaved guests would not normally present a heavily populated address space here, but it also cannot be entirely excluded) - the upper bound to the number of operations on x86-64 is 512**4 or 2**36 l1 table entries (ignoring the hypervisor hole which doesn''t need processing). Jan On 28/4/08 14:45, "Jan Beulich" <jbeulich@novell.com> wrote:> In (3.0.4-based) SLE10 SP1 we are currently dealing with a (reproducible) > report of time getting screwed up during domain shutdown. Debugging > revealed that the PM timer misses at least one overflow (i.e. platform > time lost about 4 seconds), which subsequently leads to disastrous > effects. > > Apart from tracking the time calibration, as the (currently) last step of > narrowing the cause I now made the first processor detecting severe > anomalies in time flow send an IPI to CPU0 (which is exclusively > responsible for managing platform time), which appears to prove that > this CPU is indeed busy processing a domain_kill() request, and namely > is in the process of tearing down the address spaces of the guest. > > Obviously, the hypervisor''s behavior should not depend on the amount > of time needed to free a dead domain''s resources, but the way it is > coded (and from doing some code comparison I would conclude that > while the code has significantly changed, the base characteristic of > domain shutdown being executed synchronously on the CPU requesting > so doesn''t appear to have changed - of course, history shows that I > may easily overlook something here), and if that CPU happens to be > CPU0 the whole system will suffer due to the asymmetry of platform > time handling. > > If I''m indeed not overlooking an important fix in that area, what would > be considered a reasonable solution to this? I can imagine (in order of > my preference) > > - inserting calls to do_softirq() in the put_page_and_type() call > hierarchy (e.g. in alloc_l2_table() or even alloc_l1_table(), to > guarantee uniform behavior across sub-architectures; this might help > address other issues as the same scenario might happen when a > page table hierarchy gets destroyed at times other than domain > shutdown); perhaps the same might then also be needed in the > get_page_type() hierarchy, e.g. in free_l{2,1}_table() > > - simply doing round-robin responsibility of platform time among all > CPUs (would leave the unlikely UP case as still affected by the problem) > > - detecting platform timer overflow (and properly estimating how many > times it has overflowed) and sync-ing platform time back from local time > (as indicated in a comment somewhere) > > - marshalling the whole operation to another CPU > > For reference, this is the CPU0 backtrace I''m getting from the IPI: > > (XEN) *** Dumping CPU0 host state: *** > (XEN) State at keyhandler.c:109 > (XEN) ----[ Xen-3.0.4_13138-0.63 x86_64 debug=n Not tainted ]---- > (XEN) CPU: 0 > (XEN) RIP: e008:[<ffff83000010e8a2>] dump_execstate+0x62/0xe0 > (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor > (XEN) rax: 0000000000000000 rbx: 0000000000000000 rcx: 000000000013dd62 > (XEN) rdx: 000000000000000a rsi: 000000000000000a rdi: ffff8300002b2142 > (XEN) rbp: 0000000000000000 rsp: ffff8300001d3a30 r8: 0000000000000001 > (XEN) r9: 0000000000000001 r10: 00000000fffffffc r11: 0000000000000001 > (XEN) r12: 0000000000000001 r13: 0000000000000001 r14: 0000000000000001 > (XEN) r15: cccccccccccccccd cr0: 0000000080050033 cr4: 00000000000006f0 > (XEN) cr3: 000000000ce02000 cr2: 00002b47f8871ca8 > (XEN) ds: 0000 es: 0000 fs: 0063 gs: 0000 ss: e010 cs: e008 > (XEN) Xen stack trace from rsp=ffff8300001d3a30: > (XEN) 0000000000000046 ffff830000f7e280 ffff8300002b0e00 ffff830000f7e280 > (XEN) ffff83000013b665 0000000000000000 ffff83000012dc8a cccccccccccccccd > (XEN) 0000000000000001 0000000000000001 0000000000000001 ffff830000f7e280 > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > (XEN) ffff8284008f7aa0 ffff8284008f7ac8 0000000000000000 0000000000000000 > (XEN) 0000000000039644 ffff8284008f7aa0 000000fb00000000 ffff83000011345d > (XEN) 000000000000e008 0000000000000246 ffff8300001d3b18 000000000000e010 > (XEN) ffff830000113348 ffff83000013327f 0000000000000000 ffff8284008f7aa0 > (XEN) ffff8307cc1b7288 ffff8307cc1b8000 ffff830000f7e280 00000000007cc315 > (XEN) ffff8284137e4498 ffff830000f7e280 ffff830000132c24 0000000020000001 > (XEN) 0000000020000000 ffff8284137e4498 00000000007cc315 ffff8284137e7b48 > (XEN) ffff830000132ec4 ffff8284137e4498 000000000000015d ffff830000f7e280 > (XEN) ffff8300001328d2 ffff8307cc315ae8 ffff830000132cbb 0000000040000001 > (XEN) 0000000040000000 ffff8284137e7b48 ffff830000f7e280 ffff8284137f6be8 > (XEN) ffff830000132ec4 ffff8284137e7b48 00000000007cc919 ffff8307cc91a000 > (XEN) ffff8300001331a2 ffff8307cc919018 ffff830000132d41 0000000060000001 > (XEN) 0000000060000000 ffff8284137f6be8 0000000000006ea6 ffff8284001149f0 > (XEN) ffff830000132ec4 ffff8284137f6be8 0000000000000110 ffff830000f7e280 > (XEN) ffff830000133132 ffff830006ea6880 ffff830000132df0 0000000080000001 > (XEN) 0000000080000000 ffff8284001149f0 ffff8284001149f0 ffff8284001149f0 > (XEN) Xen call trace: > (XEN) [<ffff83000010e8a2>] dump_execstate+0x62/0xe0 > (XEN) [<ffff83000013b665>] smp_call_function_interrupt+0x55/0xa0 > (XEN) [<ffff83000012dc8a>] call_function_interrupt+0x2a/0x30 > (XEN) [<ffff83000011345d>] free_domheap_pages+0x2bd/0x3b0 > (XEN) [<ffff830000113348>] free_domheap_pages+0x1a8/0x3b0 > (XEN) [<ffff83000013327f>] put_page_from_l1e+0x9f/0x120 > (XEN) [<ffff830000132c24>] free_page_type+0x314/0x540 > (XEN) [<ffff830000132ec4>] put_page_type+0x74/0xf0 > (XEN) [<ffff8300001328d2>] put_page_from_l2e+0x32/0x70 > (XEN) [<ffff830000132cbb>] free_page_type+0x3ab/0x540 > (XEN) [<ffff830000132ec4>] put_page_type+0x74/0xf0 > (XEN) [<ffff8300001331a2>] put_page_from_l3e+0x32/0x70 > (XEN) [<ffff830000132d41>] free_page_type+0x431/0x540 > (XEN) [<ffff830000132ec4>] put_page_type+0x74/0xf0 > (XEN) [<ffff830000133132>] put_page_from_l4e+0x32/0x70 > (XEN) [<ffff830000132df0>] free_page_type+0x4e0/0x540 > (XEN) [<ffff830000132ec4>] put_page_type+0x74/0xf0 > (XEN) [<ffff83000012923a>] relinquish_memory+0x17a/0x290 > (XEN) [<ffff830000183665>] identify_cpu+0x5/0x1f0 > (XEN) [<ffff830000117f10>] vcpu_runstate_get+0xb0/0xf0 > (XEN) [<ffff8300001296aa>] domain_relinquish_resources+0x35a/0x3b0 > (XEN) [<ffff8300001083e8>] domain_kill+0x28/0x60 > (XEN) [<ffff830000107560>] do_domctl+0x690/0xe60 > (XEN) [<ffff830000121def>] __putstr+0x1f/0x70 > (XEN) [<ffff830000138016>] mod_l1_entry+0x636/0x670 > (XEN) [<ffff830000118143>] schedule+0x1f3/0x270 > (XEN) [<ffff830000175ca6>] toggle_guest_mode+0x126/0x140 > (XEN) [<ffff830000175fa8>] do_iret+0xa8/0x1c0 > (XEN) [<ffff830000173b32>] syscall_enter+0x62/0x67 > > Jan > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 28/4/08 15:30, "Jan Beulich" <jbeulich@novell.com> wrote:> Okay, thanks - so I indeed missed the call to hypercall_preempt_check() > in relinquish_memory(), which is the key indicator here. > > However, that change deals exclusively with domain shutdown, but not > with the more general page table pinning/unpinning operations, which I > believe are (as described) vulnerable to mis-use by a malicious guest (I > realize that well behaved guests would not normally present a heavily > populated address space here, but it also cannot be entirely excluded) > - the upper bound to the number of operations on x86-64 is 512**4 > or 2**36 l1 table entries (ignoring the hypervisor hole which doesn''t > need processing).True. It turns out to be good enough in practice though. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 30.04.08 16:26 >>> >On 30/4/08 15:00, "Jan Beulich" <jbeulich@novell.com> wrote: >> >> According to two forced backtraces with about a second delta, the >> hypervisor is in the process of releasing the 1:1 mapping of the >> guest kernel and managed, during that one second, to increment >> i in free_l3_table() by just 1. This would make up for unbelievable >> 13,600 clocks per l1 entry being freed. > >That''s not great. :-) At such a high cost, perhaps some tracing might >indicate if we are taking some stupid slow path in free_domheap_page() or >cleanup_page_cacheattr()? I very much hope that 13600 cycles cannot be >legitimately accounted for!I''m afraid it''s really that bad. I used another (local to my office) machine, and the numbers aren''t exactly as bad as on the box they were originally measured on, but after getting the cumulative clock cycles spent in free_l1_table() and free_domheap_pages() (and their descendants, so the former obviously includes a large part of the latter) during the largest single run of relinquish_memory() I''m getting an average of 3,400 clocks spent in free_domheap_pages() (with all but very few pages going onto the scrub list) and 8,500 clocks spent per page table entry (assuming all entries are populated, so the number really is higher) in free_l1_table(). It''s the relationship between the two numbers that makes me believe that there''s really this much time spent on it. For the specific case of cleaning up after a domain, there seems to be a pretty simple workaround, though: free_l{3,4}_table() can simply avoid recursing into put_page_from_l{3,4}e() by checking d->arch.relmem being RELMEM_dom_l{3,4}. This, as expected, reduces the latency of preempting relinquish_memory() (for a 5G domU) on the box I tested from about 3s to less than half a second - if that''s considered still too much, the same kind of check could of course be added to free_l2_table(). But as there''s no similarly simple mechanism to deal with the DoS potential in pinning/unpinning or installing L4 (and maybe L3) table entries, there''ll need to be a way to preempt these call trees anyway. Since hypercalls cannot nest, storing respective state in the vcpu structure shouldn''t be a problem, but what I''m unsure about is what side effects a partially validated page table might introduce. While looking at this I wondered whether there really is a way for Xen heap pages to end up being guest page tables (or similarly descriptor table ones)? I would think if that happened this would be a bug (and perhaps a security issue). If it cannot happen, then the RELMEM_* states could be simplified and domain_relinquish_resources() shortened. (I was traveling, so it took a while to get to do the measurements.) Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 8/5/08 10:58, "Jan Beulich" <jbeulich@novell.com> wrote:> While looking at this I wondered whether there really is a way for > Xen heap pages to end up being guest page tables (or similarly > descriptor table ones)? I would think if that happened this would be > a bug (and perhaps a security issue). If it cannot happen, then the > RELMEM_* states could be simplified and > domain_relinquish_resources() shortened.You mean just force page-table type counts to zero, and drop main reference count by the same amount? Might work. Would need some thought. 8500 cycles per pte is pretty fantastic. I suppose a few atomic ops are involved. Are you running on an old P4? :-) -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 08.05.08 12:12 >>> >On 8/5/08 10:58, "Jan Beulich" <jbeulich@novell.com> wrote: > >> While looking at this I wondered whether there really is a way for >> Xen heap pages to end up being guest page tables (or similarly >> descriptor table ones)? I would think if that happened this would be >> a bug (and perhaps a security issue). If it cannot happen, then the >> RELMEM_* states could be simplified and >> domain_relinquish_resources() shortened. > >You mean just force page-table type counts to zero, and drop main reference >count by the same amount? Might work. Would need some thought.No, here I mean having just RELMEM_xen and RELMEM_l{1,2,3,4}. Then simply release Xen pages first, then l4...l1. For the suggested workaround to reduce latency of relinquish_memory() preemption, I simply mean utilizing the code to deal with circular references also for releasing simple ones (that code path doesn''t seem to care to force the type count to zero, but as I understand that''s no problem since these pages end up being freed anyway, and that''s where the whole type_info field gets re-initialized - or was this happening when the page gets allocated the next time).>8500 cycles per pte is pretty fantastic. I suppose a few atomic ops are >involved. Are you running on an old P4? :-)Not too old, it''s what they called Tulsa as codename (i.e. some of the about two year old Xeons). But I suppose that generally the bigger the box (in terms of number of threads/cores/sockets) the higher the price for atomic ops. In trying to get a picture, I e.g. measured both the cumulative full free_domheap_pages()'' and free_l1_table()''s contributions as well as just the d != NULL sub-part of free_domheap_pages() - example results are 0x2a4990400 clocks for full free_l1_table() 0x10f1a3749 clocks for full free_domheap_pages() 0x0ec6748eb clocks for the d != NULL body Given how little it is that happens outside of that d != NULL body I''m concluding that the atomic ops are by far not alone responsible for the long execution time. These are dual-threaded CPUs, however, so that while the system was doing nothing else I cannot exclude that the CPUs do something dumb in switching between the threads. But excluding this possible effect from the picture seems to have little sense, since we need to be able to deal with the situation anyway. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 8/5/08 11:39, "Jan Beulich" <jbeulich@novell.com> wrote:> No, here I mean having just RELMEM_xen and RELMEM_l{1,2,3,4}. > Then simply release Xen pages first, then l4...l1. > > For the suggested workaround to reduce latency of relinquish_memory() > preemption, I simply mean utilizing the code to deal with circular > references also for releasing simple ones (that code path doesn''t seem > to care to force the type count to zero, but as I understand that''s no > problem since these pages end up being freed anyway, and that''s > where the whole type_info field gets re-initialized - or was this > happening when the page gets allocated the next time).You''ve lost me. Either you are confused or I have forgotten the details of how that shutdown code works. Either is quite possible I suspect. :-) Basically I don''t see how this avoids the recursive, and potentially rather expensive, teardown. Nor am I convinced about how much potential time-saving there is to be had here. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 08.05.08 12:52 >>> >On 8/5/08 11:39, "Jan Beulich" <jbeulich@novell.com> wrote: > >> No, here I mean having just RELMEM_xen and RELMEM_l{1,2,3,4}. >> Then simply release Xen pages first, then l4...l1. >> >> For the suggested workaround to reduce latency of relinquish_memory() >> preemption, I simply mean utilizing the code to deal with circular >> references also for releasing simple ones (that code path doesn''t seem >> to care to force the type count to zero, but as I understand that''s no >> problem since these pages end up being freed anyway, and that''s >> where the whole type_info field gets re-initialized - or was this >> happening when the page gets allocated the next time). > >You''ve lost me. Either you are confused or I have forgotten the details of >how that shutdown code works. Either is quite possible I suspect. :-) >Basically I don''t see how this avoids the recursive, and potentially rather >expensive, teardown.All I mean is a change like this: --- 2008-05-08.orig/xen/arch/x86/mm.c +++ 2008-05-08/xen/arch/x86/mm.c @@ -1341,6 +1341,9 @@ static void free_l3_table(struct page_in l3_pgentry_t *pl3e; int i; + if(d->arch.relmem == RELMEM_dom_l3) + return; + pl3e = map_domain_page(pfn); for ( i = 0; i < L3_PAGETABLE_ENTRIES; i++ ) @@ -1364,6 +1367,9 @@ static void free_l4_table(struct page_in l4_pgentry_t *pl4e = page_to_virt(page); int i; + if(d->arch.relmem == RELMEM_dom_l4) + return; + for ( i = 0; i < L4_PAGETABLE_ENTRIES; i++ ) if ( is_guest_l4_slot(d, i) ) put_page_from_l4e(pl4e[i], pfn); I tried it out on SLE10 SP1 (3.0.4 derived), and it appeared to work and serve the purpose. With this, L3 and L2 tables are no longer freed recursively upon an L4/L3 one dropping its last type reference, but they rather get caught by the code that so far was only responsible for dealing with circular references. The result is that between individual full L2 tables (including the L1s hanging off of them) being released there now is a preemption check. Up until now, when the last L4 table got freed, everything hanging off of it needed to be dealt with in a single non-preemptible chunk.>Nor am I convinced about how much potential time-saving >there is to be had here.I''m not seeing any time saving here. The other thing I brought up was just an unrelated item pointing out potential for code simplification. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 8/5/08 12:13, "Jan Beulich" <jbeulich@novell.com> wrote:>> Nor am I convinced about how much potential time-saving >> there is to be had here. > > I''m not seeing any time saving here. The other thing I brought up > was just an unrelated item pointing out potential for code > simplification.Ah, yes, I see. The approach looks plausible. I think in its current form it will leave zombie L2/L3 pages hanging around and the domain will never actually properly die (e.g., still will be visible with the ''q'' key). Because although you do get around to doing free_lX_table(), the type count and ref count of the L2/L3 pages will not drop to zero because the dead L3/L4 page never actually dropped its references properly. In actuality, since we know that we never have ''cross-domain'' pagetable type references, we should actually be able to zap pagetable reference counts to zero. The only reason we don''t do that right now is really because it provides good debugging info to see whether a domain''s refcounts have got screwed up. But that would not prevent us doing something faster for NDEBUG builds, at least. Does that make sense? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 08.05.08 14:11 >>> >The approach looks plausible. I think in its current form it will leave >zombie L2/L3 pages hanging around and the domain will never actually >properly die (e.g., still will be visible with the ''q'' key). Because >although you do get around to doing free_lX_table(), the type count and ref >count of the L2/L3 pages will not drop to zero because the dead L3/L4 page >never actually dropped its references properly.Hmm, indeed, I should look for this after the next run.>In actuality, since we know that we never have ''cross-domain'' pagetable type >references, we should actually be able to zap pagetable reference counts to >zero. The only reason we don''t do that right now is really because it >provides good debugging info to see whether a domain''s refcounts have got >screwed up. But that would not prevent us doing something faster for NDEBUG >builds, at least. > >Does that make sense?Yes, except for me not immediately seeing why this is then not also a problem for the current circular reference handling. But really, rather than introducing (and fixing) the hack here I''d much prefer a generic solution to the problem, and you didn''t say a word on the thoughts I had on that (but in a mail a couple of days ago you indicated you might get around doing something in that area yourself, so I half way implied you may have a mechanism in mind already). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 8/5/08 13:33, "Jan Beulich" <jbeulich@novell.com> wrote:>> In actuality, since we know that we never have ''cross-domain'' pagetable type >> references, we should actually be able to zap pagetable reference counts to >> zero. The only reason we don''t do that right now is really because it >> provides good debugging info to see whether a domain''s refcounts have got >> screwed up. But that would not prevent us doing something faster for NDEBUG >> builds, at least. >> >> Does that make sense? > > Yes, except for me not immediately seeing why this is then not also a > problem for the current circular reference handling.Because ultimately the reference(s) that are still being held on the page we are unvalidating and calling free_lX_table() on will get dropped, due to the fact we are breaking the circular chain and calling free_lX_table()->put_page_and_type()->...> But really, rather than introducing (and fixing) the hack here I''d much > prefer a generic solution to the problem, and you didn''t say a word on > the thoughts I had on that (but in a mail a couple of days ago you > indicated you might get around doing something in that area yourself, > so I half way implied you may have a mechanism in mind already).I don''t have a very clear plan, except that some kind of continuation (basically encoding of how far we got) must be encoded in the page_info structure. We should be able to find spare bits for a page which is in this in-between state. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 08.05.08 14:36 >>> >> But really, rather than introducing (and fixing) the hack here I''d much >> prefer a generic solution to the problem, and you didn''t say a word on >> the thoughts I had on that (but in a mail a couple of days ago you >> indicated you might get around doing something in that area yourself, >> so I half way implied you may have a mechanism in mind already). > >I don''t have a very clear plan, except that some kind of continuation >(basically encoding of how far we got) must be encoded in the page_info >structure. We should be able to find spare bits for a page which is in this >in-between state.Hmm, storing this in page_info seems questionable to me. It''d be at least 18 bits (on x86-64) that we''d need. I think this rather has to go into struct vcpu. But what worries me more is that (obviously) any affected page will have to have its PGT_validated bit kept clear, which could lead to undesirable latencies in spin loops on other vcpus waiting for it to become set. In the worst case this could lead to deadlocks (at least in the UP case or when multiple vCPU-s of one guest are pinned to the same physical CPU) afaics. Perhaps this part could indeed be addressed with a new PGT_* bit, upon which waiters could exit their spin loops and consider themselves preempted. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 8/5/08 15:29, "Jan Beulich" <jbeulich@novell.com> wrote:> Hmm, storing this in page_info seems questionable to me. It''d be at > least 18 bits (on x86-64) that we''d need. I think this rather has to go > into struct vcpu.We can, for example, reuse tlbflush_timestamp for this purpose. Stick it in the vcpu structure and I think we make life hard for ourselves. What if the guest does not resume the hypercall, for example? What if the guest goes and tries to execute a different hypercall instead?> But what worries me more is that (obviously) any affected page will > have to have its PGT_validated bit kept clear, which could lead to > undesirable latencies in spin loops on other vcpus waiting for it to > become set. In the worst case this could lead to deadlocks (at least > in the UP case or when multiple vCPU-s of one guest are pinned to > the same physical CPU) afaics. Perhaps this part could indeed be > addressed with a new PGT_* bit, upon which waiters could exit > their spin loops and consider themselves preempted.Yes, the page state machine does need some more careful thought. I''m pretty sure we have enough page state bits though. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 08.05.08 14:11 >>> >On 8/5/08 12:13, "Jan Beulich" <jbeulich@novell.com> wrote: > >>> Nor am I convinced about how much potential time-saving >>> there is to be had here. >> >> I''m not seeing any time saving here. The other thing I brought up >> was just an unrelated item pointing out potential for code >> simplification. > >Ah, yes, I see. > >The approach looks plausible. I think in its current form it will leave >zombie L2/L3 pages hanging around and the domain will never actually >properly die (e.g., still will be visible with the ''q'' key). Because >although you do get around to doing free_lX_table(), the type count and ref >count of the L2/L3 pages will not drop to zero because the dead L3/L4 page >never actually dropped its references properly.Indeed, the extended version below avoids this.>In actuality, since we know that we never have ''cross-domain'' pagetable type >references, we should actually be able to zap pagetable reference counts to >zero. The only reason we don''t do that right now is really because it >provides good debugging info to see whether a domain''s refcounts have got >screwed up. But that would not prevent us doing something faster for NDEBUG >builds, at least.I still thought it''d be better to not simply zap the counts, but incrementally drop them using the proper interface: Index: 2008-05-08/xen/arch/x86/domain.c ==================================================================--- 2008-05-08.orig/xen/arch/x86/domain.c 2008-05-07 12:21:36.000000000 +0200 +++ 2008-05-08/xen/arch/x86/domain.c 2008-05-09 12:05:18.000000000 +0200 @@ -1725,6 +1725,23 @@ static int relinquish_memory( if ( test_and_clear_bit(_PGC_allocated, &page->count_info) ) put_page(page); + y = page->u.inuse.type_info; + + /* + * Forcibly drop reference counts of page tables above top most (which + * were skipped to prevent long latencies due to deep recursion - see + * the special treatment in free_lX_table()). + */ + if ( type < PGT_root_page_table && + unlikely(((y + PGT_type_mask) & + (PGT_type_mask|PGT_validated)) == type) ) { + BUG_ON((y & PGT_count_mask) >= (page->count_info & PGC_count_mask)); + while ( y & PGT_count_mask ) { + put_page_and_type(page); + y = page->u.inuse.type_info; + } + } + /* * Forcibly invalidate top-most, still valid page tables at this point * to break circular ''linear page table'' references. This is okay @@ -1732,7 +1749,6 @@ static int relinquish_memory( * is now dead. Thus top-most valid tables are not in use so a non-zero * count means circular reference. */ - y = page->u.inuse.type_info; for ( ; ; ) { x = y; @@ -1896,6 +1912,9 @@ int domain_relinquish_resources(struct d /* fallthrough */ case RELMEM_done: + ret = relinquish_memory(d, &d->page_list, PGT_l1_page_table); + if ( ret ) + return ret; break; default: Index: 2008-05-08/xen/arch/x86/mm.c ==================================================================--- 2008-05-08.orig/xen/arch/x86/mm.c 2008-05-08 12:13:40.000000000 +0200 +++ 2008-05-08/xen/arch/x86/mm.c 2008-05-08 13:04:13.000000000 +0200 @@ -1341,6 +1341,9 @@ static void free_l3_table(struct page_in l3_pgentry_t *pl3e; int i; + if(d->arch.relmem == RELMEM_dom_l3) + return; + pl3e = map_domain_page(pfn); for ( i = 0; i < L3_PAGETABLE_ENTRIES; i++ ) @@ -1364,6 +1367,9 @@ static void free_l4_table(struct page_in l4_pgentry_t *pl4e = page_to_virt(page); int i; + if(d->arch.relmem == RELMEM_dom_l4) + return; + for ( i = 0; i < L4_PAGETABLE_ENTRIES; i++ ) if ( is_guest_l4_slot(d, i) ) put_page_from_l4e(pl4e[i], pfn); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/5/08 11:23, "Jan Beulich" <jbeulich@novell.com> wrote:> Indeed, the extended version below avoids this. > >> In actuality, since we know that we never have ''cross-domain'' pagetable type >> references, we should actually be able to zap pagetable reference counts to >> zero. The only reason we don''t do that right now is really because it >> provides good debugging info to see whether a domain''s refcounts have got >> screwed up. But that would not prevent us doing something faster for NDEBUG >> builds, at least. > > I still thought it''d be better to not simply zap the counts, but > incrementally drop them using the proper interface:Theoretically you can still race PIN_Lx_TABLE hypercalls from other dom0 VCPUs. Obviously that would only happen from a misbehaving dom0 though. I think this patch is a reasonable stopgap measure. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 28.04.08 16:42 >>> >On 28/4/08 15:30, "Jan Beulich" <jbeulich@novell.com> wrote: > >> Okay, thanks - so I indeed missed the call to hypercall_preempt_check() >> in relinquish_memory(), which is the key indicator here. >> >> However, that change deals exclusively with domain shutdown, but not >> with the more general page table pinning/unpinning operations, which I >> believe are (as described) vulnerable to mis-use by a malicious guest (I >> realize that well behaved guests would not normally present a heavily >> populated address space here, but it also cannot be entirely excluded) >> - the upper bound to the number of operations on x86-64 is 512**4 >> or 2**36 l1 table entries (ignoring the hypervisor hole which doesn''t >> need processing). > >True. It turns out to be good enough in practice though.I''m afraid that''s not the case - after they are now using the domain shutdown fix successfully, they upgraded the machine to 64G and the system fails to boot. Sounds exactly like other reports we had on the list regarding boot failures with lots of memory that can be avoided using dom0_mem=<much smaller value>. As I understand it, this is due to the way the kernel creates its 1:1 mapping - the hypervisor has to validate the whole tree from each L4 entry being installed in a single step - for a 4G machine I measured half a second for this operation, so obviously anything beyond 32G is open for problems when the PM timer is in use. Unless you tell me that this is on your very short term agenda to work on, I''ll make an attempt at finding a reasonable solution starting tomorrow. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Jan Beulich" <jbeulich@novell.com> 14.05.08 17:54 >>> >>>> Keir Fraser <keir.fraser@eu.citrix.com> 28.04.08 16:42 >>> >>On 28/4/08 15:30, "Jan Beulich" <jbeulich@novell.com> wrote: >> >>> Okay, thanks - so I indeed missed the call to hypercall_preempt_check() >>> in relinquish_memory(), which is the key indicator here. >>> >>> However, that change deals exclusively with domain shutdown, but not >>> with the more general page table pinning/unpinning operations, which I >>> believe are (as described) vulnerable to mis-use by a malicious guest (I >>> realize that well behaved guests would not normally present a heavily >>> populated address space here, but it also cannot be entirely excluded) >>> - the upper bound to the number of operations on x86-64 is 512**4 >>> or 2**36 l1 table entries (ignoring the hypervisor hole which doesn''t >>> need processing). >> >>True. It turns out to be good enough in practice though. > >I''m afraid that''s not the case - after they are now using the domain >shutdown fix successfully, they upgraded the machine to 64G and >the system fails to boot. Sounds exactly like other reports we had on >the list regarding boot failures with lots of memory that can be avoided >using dom0_mem=<much smaller value>. As I understand it, this is >due to the way the kernel creates its 1:1 mapping - the hypervisor has >to validate the whole tree from each L4 entry being installed in a single >step - for a 4G machine I measured half a second for this operation, soSorry, I meant to write 1/8th of a second. But that''s on a small (and hence memory-wise faster) machine. Didn''t measure on my bigger box, yet.>obviously anything beyond 32G is open for problems when the PM timer >is in use.This number wasn''t consistent then either - the boundary would rather be around 64G on that system, but obviously lower on others. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 14/5/08 16:54, "Jan Beulich" <jbeulich@novell.com> wrote:> I''m afraid that''s not the case - after they are now using the domain > shutdown fix successfully, they upgraded the machine to 64G and > the system fails to boot. Sounds exactly like other reports we had on > the list regarding boot failures with lots of memory that can be avoided > using dom0_mem=<much smaller value>. As I understand it, this is > due to the way the kernel creates its 1:1 mapping - the hypervisor has > to validate the whole tree from each L4 entry being installed in a single > step - for a 4G machine I measured half a second for this operation, so > obviously anything beyond 32G is open for problems when the PM timer > is in use.Hmm, yes that makes sense. 32GB is 8M ptes, so I could imagine that taking a while to validate. Anyhow this obviously needs fixing regardless of the specific details of this specific failure case.> Unless you tell me that this is on your very short term agenda to work on, > I''ll make an attempt at finding a reasonable solution starting tomorrow.Yes, I''ll sort this one out hopefully by next week. I think this can be solved pretty straightforwardly. It''s the encoding of the continuation into the page_info structure, and synchronisation of that, that needs some back-of-envelope thought. As long as there are not too many callers of {get,put}_page_type(L{2,3,4}_pagetable), and I don''t think we have that many, then the changes should be pretty localised. Only those callers have to deal with ''EAGAIN'' (or equivalent). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel