Hi Linus, Here are some small Xen bugfixes: * fix dom0 boot on systems whose E820 doesn''t completely cover the ISA address space. This fixes a crash on a Dell R310. * fix misallocation of initial pagetables on 32-bit systems (sizeof(pmd_t) != sizeof(pmd_t *)). In practice this didn''t cause a problem because the following allocation was page-aligned so the padding was big enough to make up the difference. * in xenfs, properly report put_user failures to write back error status The changes are on two branches: git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git upstream/core Ian Campbell (2): xen: do not release any memory under 1M in domain 0 xen: correct size of level2_kernel_pgt arch/x86/xen/mmu.c | 2 +- arch/x86/xen/setup.c | 31 ++++++++++++++++++++++++++----- 2 files changed, 27 insertions(+), 6 deletions(-) git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git upstream/xenfs Vasiliy Kulikov (1): xen: xenfs: privcmd: check put_user() return code drivers/xen/xenfs/privcmd.c | 8 ++------ 1 files changed, 2 insertions(+), 6 deletions(-) Thanks, J diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index e41683c..13cd4eb 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -2104,7 +2104,7 @@ __init pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd, { pmd_t *kernel_pmd; - level2_kernel_pgt = extend_brk(sizeof(pmd_t *) * PTRS_PER_PMD, PAGE_SIZE); + level2_kernel_pgt = extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE); max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->pt_base) + xen_start_info->nr_pt_frames * PAGE_SIZE + diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c index 8e2c9f21..1f49951 100644 --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -84,6 +84,22 @@ static unsigned long __init xen_release_chunk(phys_addr_t start_addr, start = PFN_UP(start_addr); end = PFN_DOWN(end_addr); + /* + * Domain 0 maintains a 1-1 P2M mapping for the first megabyte + * so do not return such memory to the hypervisor. + * + * This region can contain various firmware tables and the + * like which are often assumed to be always mapped and + * available via phys_to_virt. + */ + if (xen_initial_domain()) { + if (end < PFN_DOWN(ISA_END_ADDRESS)) + return 0; + + if (start < PFN_DOWN(ISA_END_ADDRESS)) + start = PFN_DOWN(ISA_END_ADDRESS); + } + if (end <= start) return 0; @@ -163,6 +179,7 @@ char * __init xen_memory_setup(void) XENMEM_memory_map; rc = HYPERVISOR_memory_op(op, &memmap); if (rc == -ENOSYS) { + BUG_ON(xen_initial_domain()); memmap.nr_entries = 1; map[0].addr = 0ULL; map[0].size = mem_end; @@ -200,12 +217,16 @@ char * __init xen_memory_setup(void) } /* - * Even though this is normal, usable memory under Xen, reserve - * ISA memory anyway because too many things think they can poke - * about in there. + * Even though this is normal, usable memory in a Xen domU, + * reserve ISA memory anyway because too many things think + * they can poke about in there. + * + * In dom0 we use the host e820 and therefore do not need to + * specially reserved anything. */ - e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS, - E820_RESERVED); + if (!xen_initial_domain()) + e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS, + E820_RESERVED); /* * Reserve Xen bits: diff --git a/drivers/xen/xenfs/privcmd.c b/drivers/xen/xenfs/privcmd.c index f80be7f..2eb04c8 100644 --- a/drivers/xen/xenfs/privcmd.c +++ b/drivers/xen/xenfs/privcmd.c @@ -266,9 +266,7 @@ static int mmap_return_errors(void *data, void *state) xen_pfn_t *mfnp = data; struct mmap_batch_state *st = state; - put_user(*mfnp, st->user++); - - return 0; + return put_user(*mfnp, st->user++); } static struct vm_operations_struct privcmd_vm_ops; @@ -323,10 +321,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata) up_write(&mm->mmap_sem); if (state.err > 0) { - ret = 0; - state.user = m.arr; - traverse_pages(m.num, sizeof(xen_pfn_t), + ret = traverse_pages(m.num, sizeof(xen_pfn_t), &pagelist, mmap_return_errors, &state); } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, Oct 29, 2010 at 11:57 AM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:> > * fix dom0 boot on systems whose E820 doesn''t completely cover the > ISA address space. This fixes a crash on a Dell R310.Hmm. This clashes with my current tree. And that conflict is trivial to fix up, but the thing is, I think the patch that comes from your tree is worse than what is already there. Why is that simple unconditional e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS, E820_RESERVED); not just always the right thing? Why do you have a separate hack for dom0 in xen_release_chunk() instead? That just looks bogus. The normal logic we use on PC''s is to just always reserve the low 64kB of memory, and the whole ISA space. Why doesn''t Xen just do the same? Linus _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 10/29/2010 12:08 PM, Linus Torvalds wrote:> On Fri, Oct 29, 2010 at 11:57 AM, Jeremy Fitzhardinge <jeremy@goop.org> wrote: >> * fix dom0 boot on systems whose E820 doesn''t completely cover the >> ISA address space. This fixes a crash on a Dell R310. > Hmm. This clashes with my current tree.Bugger, so it does. I just did a test merge with no complaint though; what happened? I''ll redo the patch anyway to fix the below.> And that conflict is trivial to fix up, but the thing is, I think the > patch that comes from your tree is worse than what is already there. > > Why is that simple unconditional > > e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS, > E820_RESERVED); > > not just always the right thing? Why do you have a separate hack for > dom0 in xen_release_chunk() instead? That just looks bogus.Yes, we actually had this discussion. I was for making the e820_add_region unconditional, and Ian''s counter was that it could be done in the common code rather than Xen-specific.> The normal logic we use on PC''s is to just always reserve the low 64kB > of memory, and the whole ISA space. Why doesn''t Xen just do the same?The specific issue is that the Xen domain returns any memory that''s not covered by an E820 entry back to Xen, mostly to make sure that memory isn''t wasted by being shadowed by PCI devices. But it was also doing this in the sub-1M region, which on all the machines I''ve tested on is completely covered. But on a Dell R310 there''s a little 2-page gap where some ACPI stuff is lurking, that was being released back to Xen so it couldn''t be accessed from Linux any more. The fix is to just make sure the whole low region is covered (or at least the 640k-1M space). I''ll rework the patch. Thanks, J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 10/29/2010 12:20 PM, Jeremy Fitzhardinge wrote:> On 10/29/2010 12:08 PM, Linus Torvalds wrote: >> On Fri, Oct 29, 2010 at 11:57 AM, Jeremy Fitzhardinge <jeremy@goop.org> wrote: >>> * fix dom0 boot on systems whose E820 doesn''t completely cover the >>> ISA address space. This fixes a crash on a Dell R310. >> Hmm. This clashes with my current tree. > Bugger, so it does. I just did a test merge with no complaint though; > what happened? > > I''ll redo the patch anyway to fix the below. > >> And that conflict is trivial to fix up, but the thing is, I think the >> patch that comes from your tree is worse than what is already there. >> >> Why is that simple unconditional >> >> e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS, >> E820_RESERVED); >> >> not just always the right thing? Why do you have a separate hack for >> dom0 in xen_release_chunk() instead? That just looks bogus. > Yes, we actually had this discussion. I was for making the > e820_add_region unconditional, and Ian''s counter was that it could be > done in the common code rather than Xen-specific. > >> The normal logic we use on PC''s is to just always reserve the low 64kB >> of memory, and the whole ISA space. Why doesn''t Xen just do the same? > The specific issue is that the Xen domain returns any memory that''s not > covered by an E820 entry back to Xen, mostly to make sure that memory > isn''t wasted by being shadowed by PCI devices. But it was also doing > this in the sub-1M region, which on all the machines I''ve tested on is > completely covered. But on a Dell R310 there''s a little 2-page gap > where some ACPI stuff is lurking, that was being released back to Xen so > it couldn''t be accessed from Linux any more. > > The fix is to just make sure the whole low region is covered (or at > least the 640k-1M space).Hm, I see. This Dell machine stashes the MPS table in 2 pages just *below* 640k, so the ISA_START_ADDRESS-ISA_END_ADDRESS reserved range doesn''t cover it. There''s three ways to fix this: * not free memory below 1M (Ian''s current patch) * fill any E820 gaps below 1M * reserve all memory below 1M The 3rd is certainly simplest, at the cost of wasting a trivial amount of memory. Unfortunately it crashes early. Sigh, will try and sort it out this afternoon. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, 2010-10-29 at 21:06 +0100, Jeremy Fitzhardinge wrote:> On 10/29/2010 12:20 PM, Jeremy Fitzhardinge wrote: > > On 10/29/2010 12:08 PM, Linus Torvalds wrote: > >> On Fri, Oct 29, 2010 at 11:57 AM, Jeremy Fitzhardinge <jeremy@goop.org> wrote: > >>> * fix dom0 boot on systems whose E820 doesn''t completely cover the > >>> ISA address space. This fixes a crash on a Dell R310. > >> Hmm. This clashes with my current tree. > > Bugger, so it does. I just did a test merge with no complaint though; > > what happened? > > > > I''ll redo the patch anyway to fix the below. > > > >> And that conflict is trivial to fix up, but the thing is, I think the > >> patch that comes from your tree is worse than what is already there. > >> > >> Why is that simple unconditional > >> > >> e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS, > >> E820_RESERVED); > >> > >> not just always the right thing? Why do you have a separate hack for > >> dom0 in xen_release_chunk() instead? That just looks bogus. > > Yes, we actually had this discussion. I was for making the > > e820_add_region unconditional, and Ian''s counter was that it could be > > done in the common code rather than Xen-specific. > > > >> The normal logic we use on PC''s is to just always reserve the low 64kB > >> of memory, and the whole ISA space. Why doesn''t Xen just do the same? > > The specific issue is that the Xen domain returns any memory that''s not > > covered by an E820 entry back to Xen, mostly to make sure that memory > > isn''t wasted by being shadowed by PCI devices. But it was also doing > > this in the sub-1M region, which on all the machines I''ve tested on is > > completely covered. But on a Dell R310 there''s a little 2-page gap > > where some ACPI stuff is lurking, that was being released back to Xen so > > it couldn''t be accessed from Linux any more. > > > > The fix is to just make sure the whole low region is covered (or at > > least the 640k-1M space). > > Hm, I see. This Dell machine stashes the MPS table in 2 pages just > *below* 640k, so the ISA_START_ADDRESS-ISA_END_ADDRESS reserved range > doesn''t cover it.Yes, what the machine has is: (XEN) 0000000000000000 - 000000000009e000 (usable) (XEN) 0000000000100000 - 00000000bf699000 (usable) which after reserving the 640k-1M range shows up in dom0 as: BIOS-provided physical RAM map: Xen: 0000000000000000 - 000000000009e000 (usable) Xen: 00000000000a0000 - 0000000000100000 (reserved) Xen: 0000000000100000 - 0000000020000000 (usable) which has a little 2 page hole between 9e000-a0000 which xen_release_chunk dutifully punches out as a hole in both the virtual and physical address space.> There''s three ways to fix this: > > * not free memory below 1M (Ian''s current patch) > * fill any E820 gaps below 1M > * reserve all memory below 1MThe second two basically indirectly implement the first under Xen. They also seem like things which if they are correct to do in Xen dom0 they would also be correct on native too and therefore belong in sanitize_e820 or somewhere like that. The actual memory which is marked reserved (or just unmentioned) doesn''t really matter much here since the code which goes poking around in this stuff doesn''t check if it is reserved or not (and it shouldn''t since we know the BIOSes can/will get this stuff wrong).> The 3rd is certainly simplest, at the cost of wasting a trivial amount > of memory.Doesn''t Linux avoid using the lowest 1M anyway? (obviously apart from the start of day probing for firmware tables etc).> Unfortunately it crashes early. Sigh, will try and sort it > out this afternoon.Strange! _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 10/31/2010 02:13 AM, Ian Campbell wrote:>> The 3rd is certainly simplest, at the cost of wasting a trivial amount >> of memory. > Doesn''t Linux avoid using the lowest 1M anyway? (obviously apart from > the start of day probing for firmware tables etc).No, it tries to use most of it I think. It will tend to avoid the low 64k (maybe more) to avoid BIOS bugs.>> Unfortunately it crashes early. Sigh, will try and sort it >> out this afternoon. > Strange!I didn''t get a chance to poke at it again, but in retrospect, I think there are various "must succeed" allocations in low memory. We don''t need those allocations (things like AP boot trampoline, etc), but we don''t bother to stub them out or prevent them from happening. Reducing the system to one with *no* allocatable memory below 1M is just too strange, and would be a continuous source of problems in the future. Of the other two options, I think your original approach is going to be simplest. E820 gap filling wouldn''t be too bad, but we''d end up having to add a bit of gap-tracking logic to the E820 loop which isn''t currently there. Ignoring sub-1M gaps is simpler (and it needn''t be conditional on xen_initial_domain(), because we would never expect to see anything strange sub-1M in a domU, and if there is, we should still be careful of it in case something odd is going on). J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sun, 2010-10-31 at 16:28 +0000, Jeremy Fitzhardinge wrote:> On 10/31/2010 02:13 AM, Ian Campbell wrote: > >> The 3rd is certainly simplest, at the cost of wasting a trivial amount > >> of memory. > > Doesn''t Linux avoid using the lowest 1M anyway? (obviously apart from > > the start of day probing for firmware tables etc). > > No, it tries to use most of it I think. It will tend to avoid the low > 64k (maybe more) to avoid BIOS bugs.It''ll be interesting to see what effect Vista''s avoidance of the whole region (so I hear) has on BIOS vendors... (I think we can all guess)> >> Unfortunately it crashes early. Sigh, will try and sort it > >> out this afternoon. > > Strange! > > I didn''t get a chance to poke at it again, but in retrospect, I think > there are various "must succeed" allocations in low memory. We don''t > need those allocations (things like AP boot trampoline, etc), but we > don''t bother to stub them out or prevent them from happening. Reducing > the system to one with *no* allocatable memory below 1M is just too > strange, and would be a continuous source of problems in the future.Agreed, we should try and mimic native as far as possible in this regard or I fear we will see a never ending stream of little quirks and oddities related to this sort of thing.> Of the other two options, I think your original approach is going to be > simplest. E820 gap filling wouldn''t be too bad, but we''d end up having > to add a bit of gap-tracking logic to the E820 loop which isn''t > currently there.It would also make us susceptible to perhaps being a bit fragile in the face of unexpectedly insane e820s coming from the BIOS.> Ignoring sub-1M gaps is simpler (and it needn''t be > conditional on xen_initial_domain(), because we would never expect to > see anything strange sub-1M in a domU, and if there is, we should still > be careful of it in case something odd is going on).Absolutely. I wonder if we shouldn''t also do the following (note: untested). Since Xen avoids using the sub-1M region for anything I think it is reasonable to give the whole lot over to domain 0 for the purposes of finding firmware table stashed in odd locations etc. Ian. diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index ebb74ec..ab086e5 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -2321,7 +2321,7 @@ __init void xen_ident_map_ISA(void) xen_raw_printk("Xen: setup ISA identity maps\n"); - for (pa = ISA_START_ADDRESS; pa < ISA_END_ADDRESS; pa += PAGE_SIZE) { + for (pa = 0; pa < ISA_END_ADDRESS; pa += PAGE_SIZE) { pte_t pte = mfn_pte(PFN_DOWN(pa), PAGE_KERNEL_IO); if (HYPERVISOR_update_va_mapping(PAGE_OFFSET + pa, pte, 0)) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel