Chris Lalancette
2008-Oct-15 11:03 UTC
[Xen-devel] [PATCH]: Fix Xen domU boot with batched mprotect
Recent i686 2.6.27 kernels with a certain amount of memory (between 736 and 855MB) have a problem booting under a hypervisor that supports batched mprotect (this includes the RHEL-5 Xen hypervisor as well as any 3.3 or later Xen hypervisor). The problem ends up being that xen_ptep_modify_prot_commit() is using virt_to_machine to calculate which pfn to update. However, this only works for pages that are in the p2m list, and the pages coming from change_pte_range() in mm/mprotect.c are kmap_atomic pages. Because of this, we can run into the situation where the lookup in the p2m table returns an INVALID_MFN, which we then try to pass to the hypervisor, which then (correctly) denies the request to a totally bogus pfn. The right thing to do is to use arbitrary_virt_to_machine, so that we can be sure we are modifying the right pfn. This unfortunately introduces a performance penalty because of a full page-table-walk, but we can avoid that penalty for pages in the p2m list by checking if virt_addr_valid is true, and if so, just doing the lookup in the p2m table. The attached patch implements this, and allows my 2.6.27 i686 based guest with 768MB of memory to boot on a RHEL-5 hypervisor again. Thanks to Jeremy for the suggestions about how to fix this particular issue. Signed-off-by: Chris Lalancette <clalance@redhat.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Oct-15 15:24 UTC
Re: [Xen-devel] [PATCH]: Fix Xen domU boot with batched mprotect
>>> Chris Lalancette <clalance@redhat.com> 15.10.08 13:03 >>> >The right thing to do is to use arbitrary_virt_to_machine, so that we can be >sure we are modifying the right pfn. This unfortunately introduces a >performance penalty because of a full page-table-walk, but we can avoid that >penalty for pages in the p2m list by checking if virt_addr_valid is true, and if >so, just doing the lookup in the p2m table.Could you explain how virt_addr_valid() can validly be used here? Looking at its implementation #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT) a kaddr in kmap space (i.e. above high_memory) would return a bogus physical address, and hence pfn_valid() on the resulting pfn is meaningless. I''d instead simply compare the address in question against high_memory, and perhaps instead of in arbitrary_virt_to_machine() in ptep_modify_prot_commit() under an #ifdef CONFIG_HIGHPTE. But performance-wise, CONFIG_HIGHPTE sucks under Xen anyway, so you''d better not turn this on in the first place. We may want/need to provide a means to disable this at run time so the same kernel when run natively could still make use of it, but without impacting performance under Xen. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2008-Oct-15 16:23 UTC
Re: [Xen-devel] [PATCH]: Fix Xen domU boot with batched mprotect
Jan Beulich wrote:>>>> Chris Lalancette <clalance@redhat.com> 15.10.08 13:03 >>> >>>> >> The right thing to do is to use arbitrary_virt_to_machine, so that we can be >> sure we are modifying the right pfn. This unfortunately introduces a >> performance penalty because of a full page-table-walk, but we can avoid that >> penalty for pages in the p2m list by checking if virt_addr_valid is true, and if >> so, just doing the lookup in the p2m table. >> > > Could you explain how virt_addr_valid() can validly be used here? Looking > at its implementation > > #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT) > > a kaddr in kmap space (i.e. above high_memory) would return a bogus > physical address, and hence pfn_valid() on the resulting pfn is meaningless. >virt_addr_valid() is supposed to be usable in this circumstace. The comment says "virt_to_page(kaddr) returns a valid pointer if and only if virt_addr_valid(kaddr) returns true", which implies that virt_addr_valid() returns a meaningful result on all addresses - and if not, it should be fixed.> I''d instead simply compare the address in question against high_memory, > and perhaps instead of in arbitrary_virt_to_machine() in > ptep_modify_prot_commit() under an #ifdef CONFIG_HIGHPTE.I suppose, but I don''t think there''s much cost in making it generally robust.> But > performance-wise, CONFIG_HIGHPTE sucks under Xen anyway, so you''d > better not turn this on in the first place. We may want/need to provide > a means to disable this at run time so the same kernel when run natively > could still make use of it, but without impacting performance under Xen. >That''s a secondary issue. What''s the source of the performance hit? Just all the extra kmap_atomic operations? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Oct-16 07:28 UTC
Re: [Xen-devel] [PATCH]: Fix Xen domU boot with batched mprotect
>>> Jeremy Fitzhardinge <jeremy@goop.org> 15.10.08 18:23 >>> >> But >> performance-wise, CONFIG_HIGHPTE sucks under Xen anyway, so you''d >> better not turn this on in the first place. We may want/need to provide >> a means to disable this at run time so the same kernel when run natively >> could still make use of it, but without impacting performance under Xen. >> > >That''s a secondary issue. What''s the source of the performance hit? >Just all the extra kmap_atomic operations?Yes, afaict. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Oct-16 09:58 UTC
Re: [Xen-devel] [PATCH]: Fix Xen domU boot with batched mprotect
>>> Jeremy Fitzhardinge <jeremy@goop.org> 15.10.08 18:23 >>> >virt_addr_valid() is supposed to be usable in this circumstace. The >comment says "virt_to_page(kaddr) returns a valid pointer if and only if >virt_addr_valid(kaddr) returns true", which implies that >virt_addr_valid() returns a meaningful result on all addresses - and if >not, it should be fixed.Where did you find this comment? I had no luck grep-ing for it... In any case, if that''s the expectation, then on i386 virt_addr_valid() must be implemented as something like #define virt_addr_valid(kaddr) ((kaddr) >= PAGE_OFFSET && (kaddr) < high_memory && pfn_valid(__pa(kaddr) >> PAGE_SHIFT)) x86-64 would need something similar, except that high_memory obviously must be replaced (or that part could perhaps be left out altogether), and the un-mapped addresses above the kernel mapping would need to be filtered out. Btw., if you look at other architectures, you''ll see that most of them use the same (as you say broken) construct. Otoh, if that cited statement really holds, then virt_addr_valid() isn''t really expected to do what its name implies: In particular, there are valid address ranges in kernel space which it wouldn''t be permitted to return true on without significantly complicating the virt_to_page() implementation (e.g. x86-64''s vmalloc and modules areas). As to the original issue - as long as domains are given enough memory (i.e. the range of valid pfn-s is large enough), with the current state of virt_addr_valid() the patch presented ought to not help at all: This code static inline void _check_virt(const char*msg, void *v) { unsigned long pfn = __pa(v) >> PAGE_SHIFT; printk("%s: %p %08lx %d:%d\n", msg, v, pfn, pfn_valid(pfn), virt_addr_valid(v)); } static int __init check_virt(void) { _check_virt("null", NULL); _check_virt("half", (void *)LONG_MAX); _check_virt("hm-p", high_memory - PAGE_SIZE); _check_virt("hm-1", high_memory - 1); _check_virt("hm", high_memory); _check_virt("hm+1", high_memory + 1); _check_virt("hm+p", high_memory + PAGE_SIZE); _check_virt("km", (void *)__fix_to_virt(FIX_KMAP_BEGIN)); _check_virt("hv", (void *)HYPERVISOR_VIRT_START); return 0; } late_initcall(check_virt); yields a positive indication from virt_addr_valid() on all tested addresses: <4>null: 00000000 00040000 1:1 <4>half: 7fffffff 000bffff 1:1 <4>hm-p: ed7ff000 0002d7ff 1:1 <4>hm-1: ed7fffff 0002d7ff 1:1 <4>hm: ed800000 0002d800 1:1 <4>hm+1: ed800001 0002d800 1:1 <4>hm+p: ed801000 0002d801 1:1 <4>km: f56fa000 000356fa 1:1 <4>hv: f5800000 00035800 1:1 Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2008-Oct-16 16:10 UTC
Re: [Xen-devel] [PATCH]: Fix Xen domU boot with batched mprotect
Jan Beulich wrote:>>>> Jeremy Fitzhardinge <jeremy@goop.org> 15.10.08 18:23 >>> >>>> >> virt_addr_valid() is supposed to be usable in this circumstace. The >> comment says "virt_to_page(kaddr) returns a valid pointer if and only if >> virt_addr_valid(kaddr) returns true", which implies that >> virt_addr_valid() returns a meaningful result on all addresses - and if >> not, it should be fixed. >> > > Where did you find this comment? I had no luck grep-ing for it... >It''s in tip.git, which has quite a few changes in this area. http://git.kernel.org/?p=linux/kernel/git/x86/linux-2.6-tip.git;a=blob;f=include/asm-x86/page.h;h=d4f1d5791fc186f29a9a60d4fe182d80f05038e4;hb=HEAD http://git.kernel.org/?p=linux/kernel/git/x86/linux-2.6-tip.git;a=blob;f=arch/x86/mm/ioremap.c;h=ae71e11eb3e5e4ddeceadc9128d3afea564f27e0;hb=HEAD> In any case, if that''s the expectation, then on i386 virt_addr_valid() > must be implemented as something like > > #define virt_addr_valid(kaddr) ((kaddr) >= PAGE_OFFSET && (kaddr) < high_memory && pfn_valid(__pa(kaddr) >> PAGE_SHIFT)) > > x86-64 would need something similar, except that high_memory obviously > must be replaced (or that part could perhaps be left out altogether), and > the un-mapped addresses above the kernel mapping would need to be > filtered out. > > Btw., if you look at other architectures, you''ll see that most of them use > the same (as you say broken) construct. > > Otoh, if that cited statement really holds, then virt_addr_valid() isn''t > really expected to do what its name implies: In particular, there are > valid address ranges in kernel space which it wouldn''t be permitted to > return true on without significantly complicating the virt_to_page() > implementation (e.g. x86-64''s vmalloc and modules areas). >The current x86-64 implementation is: bool __virt_addr_valid(unsigned long x) { if (x >= __START_KERNEL_map) { x -= __START_KERNEL_map; if (x >= KERNEL_IMAGE_SIZE) return false; x += phys_base; } else { if (x < PAGE_OFFSET) return false; x -= PAGE_OFFSET; if (system_state == SYSTEM_BOOTING ? x > MAXMEM : !phys_addr_valid(x)) { return false; } } return pfn_valid(x >> PAGE_SHIFT); } and 32-bit is similar (but simpler, since it doesn''t need to worry about a separate kernel mapping).> > yields a positive indication from virt_addr_valid() on all tested addresses: > > <4>null: 00000000 00040000 1:1 > <4>half: 7fffffff 000bffff 1:1 > <4>hm-p: ed7ff000 0002d7ff 1:1 > <4>hm-1: ed7fffff 0002d7ff 1:1 > <4>hm: ed800000 0002d800 1:1 > <4>hm+1: ed800001 0002d800 1:1 > <4>hm+p: ed801000 0002d801 1:1 > <4>km: f56fa000 000356fa 1:1 > <4>hv: f5800000 00035800 1:1 >It would be interesting to try that with tip.git''s version of __virt_addr_valid(). In the Xen case, all we need is a guarantee that virt_addr_valid() returns true iff __pa(addr) returns a proper result, so that we can use the resulting pfn as an index into pfn->mfn. I believe this is what the current implementation does. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Oct-17 07:12 UTC
Re: [Xen-devel] [PATCH]: Fix Xen domU boot with batched mprotect
>>> Jeremy Fitzhardinge <jeremy@goop.org> 16.10.08 18:10 >>> >The current x86-64 implementation is: > >bool __virt_addr_valid(unsigned long x) >{ > if (x >= __START_KERNEL_map) { > x -= __START_KERNEL_map; > if (x >= KERNEL_IMAGE_SIZE) > return false;This, imo, is still broken (i.e. the name of the function still isn''t matched by the implementation): KERNEL_IMAGE_SIZE is a constant and doesn''t account for the fact that only the real kernel image can be relied upon to be mapped.>and 32-bit is similar (but simpler, since it doesn''t need to worry about a separate kernel mapping).This continues to be broken, but not as badly as it used to be - while it now covers user space and the vmalloc area (I''m unclear why this is excluded only after booting completed, though), hypervisor space continues to not be considered here. But as mentioned before - excluding the vmalloc area seems bogus wrt the name of the function, but as I take it the confusion here is intended. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2008-Oct-17 15:19 UTC
Re: [Xen-devel] [PATCH]: Fix Xen domU boot with batched mprotect
Jan Beulich wrote:>>>> Jeremy Fitzhardinge <jeremy@goop.org> 16.10.08 18:10 >>> >>>> >> The current x86-64 implementation is: >> >> bool __virt_addr_valid(unsigned long x) >> { >> if (x >= __START_KERNEL_map) { >> x -= __START_KERNEL_map; >> if (x >= KERNEL_IMAGE_SIZE) >> return false; >> > > This, imo, is still broken (i.e. the name of the function still isn''t matched > by the implementation): KERNEL_IMAGE_SIZE is a constant and doesn''t > account for the fact that only the real kernel image can be relied upon > to be mapped. >Perhaps, but I don''t think it matters too much. Unless you have a tiny amount of physical memory, locations in the kernel mapping beyond the actual kernel will still resolve to proper locations in the linear map.>> and 32-bit is similar (but simpler, since it doesn''t need to worry about a separate kernel mapping). >> > > This continues to be broken, but not as badly as it used to be - while it > now covers user space and the vmalloc area (I''m unclear why this is > excluded only after booting completed, though), hypervisor space > continues to not be considered here. > > But as mentioned before - excluding the vmalloc area seems bogus wrt > the name of the function, but as I take it the confusion here is intended. >I think a strictly correct name for the function would be can_i_use___pa_on_this_address(vaddr). It isn''t is_this_really_an_addressable_location(vaddr). J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Oct-17 15:30 UTC
Re: [Xen-devel] [PATCH]: Fix Xen domU boot with batched mprotect
>>> Jeremy Fitzhardinge <jeremy@goop.org> 17.10.08 17:19 >>> >Jan Beulich wrote: >>>>> Jeremy Fitzhardinge <jeremy@goop.org> 16.10.08 18:10 >>> >>>>> >>> The current x86-64 implementation is: >>> >>> bool __virt_addr_valid(unsigned long x) >>> { >>> if (x >= __START_KERNEL_map) { >>> x -= __START_KERNEL_map; >>> if (x >= KERNEL_IMAGE_SIZE) >>> return false; >>> >> >> This, imo, is still broken (i.e. the name of the function still isn''t matched >> by the implementation): KERNEL_IMAGE_SIZE is a constant and doesn''t >> account for the fact that only the real kernel image can be relied upon >> to be mapped. >> > >Perhaps, but I don''t think it matters too much. Unless you have a tiny >amount of physical memory, locations in the kernel mapping beyond the >actual kernel will still resolve to proper locations in the linear map.Is e.g. 256Mb tiny? KERNEL_IMAGE_SIZE these days is 512Mb... Indeed, when it was 40Mb (up until a few releases ago), this indeed wouldn''t matter. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2008-Oct-17 15:36 UTC
Re: [Xen-devel] [PATCH]: Fix Xen domU boot with batched mprotect
Jan Beulich wrote:> Is e.g. 256Mb tiny? KERNEL_IMAGE_SIZE these days is 512Mb... Indeed, > when it was 40Mb (up until a few releases ago), this indeed wouldn''t > matter. >Well, the real point is that anyone doing a test on such high kernel addresses is almost certainly buggy anyway. I guess a more precise statement is that it returns well-defined results for any va the calling code could reasonably be using, not for any random bit pattern. But I think we''re getting into the weeds here. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel