Hi, we have some problems with acpidump running on Xen Dom0. On 64 bit Dom0 it will trigger the OOM killer, on 32 bit Dom0s it will cause a kernel crash. The hypervisor does not matter, I tried 4.1.3-rc2 as well as various unstable versions including 25467, also 32-bit versions of 4.1. The Dom0 kernels were always PVOPS versions, the problems starts with 3.2-rc1~194 and is still in 3.5.0-rc3. Also you need to restrict the Dom0 memory with dom0_memThe crash says (on a 3.4.3 32bit Dom0 kernel): uruk:~ # ./acpidump32 [ 158.843444] ------------[ cut here ]------------ [ 158.843460] kernel BUG at mm/rmap.c:1027! [ 158.843466] invalid opcode: 0000 [#1] SMP [ 158.843472] Modules linked in: [ 158.843478] [ 158.843483] Pid: 4874, comm: acpidump32 Tainted: G W 3.4.0+ #105 empty empty/S3993 [ 158.843493] EIP: 0061:[<c10b0e27>] EFLAGS: 00010246 CPU: 3 [ 158.843505] EIP is at __page_set_anon_rmap+0x12/0x45 [ 158.843511] EAX: d6022dc0 EBX: dfecb6e0 ECX: b76faf64 EDX: b76faf64 [ 158.843516] ESI: 00000000 EDI: b76faf64 EBP: d6091e8c ESP: d6091e84 [ 158.843522] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069 [ 158.843529] CR0: 8005003b CR2: b76faf64 CR3: 17633000 CR4: 00000660 [ 158.843535] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 158.843581] DR6: ffff0ff0 DR7: 00000400 [ 158.843586] Process acpidump32 (pid: 4874, ti=d6090000 task=d60b34f0 task.ti=d6090000) [ 158.843591] Stack: [ 158.843594] dfecb6e0 00000001 d6091ea8 c10b15c4 00000000 d6022dc0 d61fbdd8 d6022dc0 [ 158.843610] 00000000 d6091efc c10aacbe 00000000 99948025 80000001 d8aa1f80 80000001 [ 158.843631] dfefc800 00000000 d8aa1f80 00000000 166b7025 d7f407d0 b76faf64 99948025 [ 158.843649] Call Trace: [ 158.843656] [<c10b15c4>] do_page_add_anon_rmap+0x5b/0x64 [ 158.843664] [<c10aacbe>] handle_pte_fault+0x81d/0xa06 [ 158.843674] [<c10ab0ff>] handle_mm_fault+0x1fa/0x209 [ 158.843683] [<c159e4e8>] ? spurious_fault+0x104/0x104 [ 158.843688] [<c159e881>] do_page_fault+0x399/0x3b4 [ 158.843696] [<c10c639d>] ? filp_close+0x55/0x5f [ 158.843701] [<c10c6408>] ? sys_close+0x61/0xa0 [ 158.843706] [<c159e4e8>] ? spurious_fault+0x104/0x104 [ 158.843714] [<c159c452>] error_code+0x5a/0x60 [ 158.843720] [<c159e4e8>] ? spurious_fault+0x104/0x104 [ 158.843724] Code: e8 45 91 00 00 89 c2 eb 09 2b 50 04 c1 ea 0c 03 50 4c 89 53 08 5b 5e 5d c3 55 89 e5 56 53 89 c3 89 d0 89 ca 8b 70 44 85 f6 75 02 <0f> 0b f6 43 04 01 75 27 83 7d 08 00 75 02 8b 36 46 89 73 04 f6 [ 158.843824] EIP: [<c10b0e27>] __page_set_anon_rmap+0x12/0x45 SS:ESP 0069:d6091e84 [ 158.843848] ---[ end trace 4eaa2a86a8e2da24 ]--- [ 158.843854] note: acpidump32[4874] exited with preempt_count 1 On 64bit the OOM goes around, finally killing the login shell: uruk:~ # ./acpidump_inst acpi_map_memory(917504, 131072); opened /dev/mem (fd=3) calling mmap(NULL, 131072, PROT_READ, MAP_PRIVATE, fd, e0000); mmap returned 0xf7571000, function returns 0xf7571000 acpi_map_table(cfef0f64, "XSDT"); acpi_map_memory(3488550756, 36); opened /dev/mem (fd=3) calling mmap(NULL, 3976, PROT_READ, MAP_PRIVATE, fd, cfef0000); mmap returned 0xf76fd000, function returns 0xf76fdf64 having mapped table header reading signature: Welcome to SUSE Linux Enterprise Server 11 SP1 (i586) - Kernel 3.5.0-rc3+ (hvc0). uruk login: ----------- This dump shows that the bug happens the moment acpidump accesses the mmapped ACPI table at @cfef0000 (the lower map at e0000 works). This is extra unfortunate as in SLES11 acpidump will be called by the kbd init script (querying the BIOS NumLock setting!) I bisected the Dom0 kernel to find this one (v3.2-rc~194): commit 5eef150c1d7e41baaefd00dd56c153debcd86aee Merge: 315eb8a f3f436e Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue Oct 25 09:17:07 2011 +0200 Merge branch ''stable/e820-3.2'' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * ''stable/e820-3.2'' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen: release all pages within 1-1 p2m mappings xen: allow extra memory to be in multiple regions xen: allow balloon driver to use more than one memory region xen/balloon: simplify test for the end of usable RAM xen/balloon: account for pages released during memory setup I tried to find something obvious, but to no avail. At least the new E820 looks sane, nothing that would prevent the mapping of the requested regions. Reverting this commit will not work easily on newer kernels, also is probably not desirable. But it does not show on every machine here, so the machine E820 could actually be a differentiator. This particular box was a dual socket Barcelona server with 12GB of memory. This whole PV memory management goes beyond my knowledge, so I''d like to ask for help on this issue. If you need more information (I attached the boot log, which shows the two E820 tables), please ask. I can also quickly do some experiments if needed. Thanks a lot, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Wed, Jun 20, 2012 at 02:37:55PM +0200, Andre Przywara wrote:> Hi, > > we have some problems with acpidump running on Xen Dom0. On 64 bit > Dom0 it will trigger the OOM killer, on 32 bit Dom0s it will cause a > kernel crash. > The hypervisor does not matter, I tried 4.1.3-rc2 as well as various > unstable versions including 25467, also 32-bit versions of 4.1. > The Dom0 kernels were always PVOPS versions, the problems starts > with 3.2-rc1~194 and is still in 3.5.0-rc3. > Also you need to restrict the Dom0 memory with dom0_mem> The crash says (on a 3.4.3 32bit Dom0 kernel): > uruk:~ # ./acpidump32 > [ 158.843444] ------------[ cut here ]------------ > [ 158.843460] kernel BUG at mm/rmap.c:1027! > [ 158.843466] invalid opcode: 0000 [#1] SMP > [ 158.843472] Modules linked in: > [ 158.843478] > [ 158.843483] Pid: 4874, comm: acpidump32 Tainted: G W > 3.4.0+ #105 empty empty/S3993 > [ 158.843493] EIP: 0061:[<c10b0e27>] EFLAGS: 00010246 CPU: 3 > [ 158.843505] EIP is at __page_set_anon_rmap+0x12/0x45 > [ 158.843511] EAX: d6022dc0 EBX: dfecb6e0 ECX: b76faf64 EDX: b76faf64 > [ 158.843516] ESI: 00000000 EDI: b76faf64 EBP: d6091e8c ESP: d6091e84 > [ 158.843522] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069 > [ 158.843529] CR0: 8005003b CR2: b76faf64 CR3: 17633000 CR4: 00000660 > [ 158.843535] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 > [ 158.843581] DR6: ffff0ff0 DR7: 00000400 > [ 158.843586] Process acpidump32 (pid: 4874, ti=d6090000 > task=d60b34f0 task.ti=d6090000) > [ 158.843591] Stack: > [ 158.843594] dfecb6e0 00000001 d6091ea8 c10b15c4 00000000 > d6022dc0 d61fbdd8 d6022dc0 > [ 158.843610] 00000000 d6091efc c10aacbe 00000000 99948025 > 80000001 d8aa1f80 80000001 > [ 158.843631] dfefc800 00000000 d8aa1f80 00000000 166b7025 > d7f407d0 b76faf64 99948025 > [ 158.843649] Call Trace: > [ 158.843656] [<c10b15c4>] do_page_add_anon_rmap+0x5b/0x64 > [ 158.843664] [<c10aacbe>] handle_pte_fault+0x81d/0xa06 > [ 158.843674] [<c10ab0ff>] handle_mm_fault+0x1fa/0x209 > [ 158.843683] [<c159e4e8>] ? spurious_fault+0x104/0x104 > [ 158.843688] [<c159e881>] do_page_fault+0x399/0x3b4 > [ 158.843696] [<c10c639d>] ? filp_close+0x55/0x5f > [ 158.843701] [<c10c6408>] ? sys_close+0x61/0xa0 > [ 158.843706] [<c159e4e8>] ? spurious_fault+0x104/0x104 > [ 158.843714] [<c159c452>] error_code+0x5a/0x60 > [ 158.843720] [<c159e4e8>] ? spurious_fault+0x104/0x104 > [ 158.843724] Code: e8 45 91 00 00 89 c2 eb 09 2b 50 04 c1 ea 0c 03 > 50 4c 89 53 08 5b 5e 5d c3 55 89 e5 56 53 89 c3 89 d0 89 ca 8b 70 44 > 85 f6 75 02 <0f> 0b f6 43 04 01 75 27 83 7d 08 00 75 02 8b 36 46 89 > 73 04 f6 > [ 158.843824] EIP: [<c10b0e27>] __page_set_anon_rmap+0x12/0x45 > SS:ESP 0069:d6091e84 > [ 158.843848] ---[ end trace 4eaa2a86a8e2da24 ]--- > [ 158.843854] note: acpidump32[4874] exited with preempt_count 1 > > > On 64bit the OOM goes around, finally killing the login shell: > uruk:~ # ./acpidump_inst > acpi_map_memory(917504, 131072); > opened /dev/mem (fd=3) > calling mmap(NULL, 131072, PROT_READ, MAP_PRIVATE, fd, e0000); > mmap returned 0xf7571000, function returns 0xf7571000 > acpi_map_table(cfef0f64, "XSDT"); > acpi_map_memory(3488550756, 36); > opened /dev/mem (fd=3) > calling mmap(NULL, 3976, PROT_READ, MAP_PRIVATE, fd, cfef0000); > mmap returned 0xf76fd000, function returns 0xf76fdf64 > having mapped table header > reading signature: > > Welcome to SUSE Linux Enterprise Server 11 SP1 (i586) - Kernel > 3.5.0-rc3+ (hvc0). > > uruk login: > ----------- > This dump shows that the bug happens the moment acpidump accesses > the mmapped ACPI table at @cfef0000 (the lower map at e0000 works).What is the e0000 one? I don''t see in your E820 the region being reserved?> > This is extra unfortunate as in SLES11 acpidump will be called by > the kbd init script (querying the BIOS NumLock setting!)Ah. Is the acpidump somewhere easily available to compile? Should I get it from here: http://www.lesswatts.org/projects/acpi/utilities.php> > I bisected the Dom0 kernel to find this one (v3.2-rc~194): > commit 5eef150c1d7e41baaefd00dd56c153debcd86aee > Merge: 315eb8a f3f436e > Author: Linus Torvalds <torvalds@linux-foundation.org> > Date: Tue Oct 25 09:17:07 2011 +0200 > > Merge branch ''stable/e820-3.2'' of > git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen > > * ''stable/e820-3.2'' of > git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:Oh boy. v3.2 .. that is eons ago! :-)> xen: release all pages within 1-1 p2m mappings > xen: allow extra memory to be in multiple regions > xen: allow balloon driver to use more than one memory region > xen/balloon: simplify test for the end of usable RAM > xen/balloon: account for pages released during memory setup > > > I tried to find something obvious, but to no avail. At least the new > E820 looks sane, nothing that would prevent the mapping of the > requested regions. Reverting this commit will not work easily on > newer kernels, also is probably not desirable.The one thing that comes to my mind is the 1-1 mapping having some issues. Can you boot the kernel with ''debug loglevel=8''. That should print something like this: Setting pfn cfef0->cfef7 to 1-1 or such during bootup.> > But it does not show on every machine here, so the machine E820 > could actually be a differentiator. This particular box was a dual > socket Barcelona server with 12GB of memory. > > This whole PV memory management goes beyond my knowledge, so I''d > like to ask for help on this issue. > If you need more information (I attached the boot log, which shows > the two E820 tables), please ask. I can also quickly do some > experiments if needed.This is strange one - the P2M code should fetch the MFN (so it should give you cfef0) whenever anybody asks for that. Lets double-check that. Can you try this little module? [not compile tested] #include <linux/module.h> #include <linux/kthread.h> #include <linux/pagemap.h> #include <linux/init.h> #include <xen/xen.h> #define ACPITEST "0.1" MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>"); MODULE_DESCRIPTION("acpitest"); MODULE_LICENSE("GPL"); MODULE_VERSION(ACPITEST); static int __init acpitest_init(void) { unsigned int pfn = 0xcfef0; unsigned int mfn; void *data; mfn = pfn_to_mfn(pfn); WARN_ON(pfn != mfn, "We get %lx instead of %lx!\n", pfn, mfn); if (pfn != mfn) { printk(KERN_INFO "raw p2m (%lx) gives us: %lx\n", pfn, get_phys_to_machine(pfn)); return -EINVAL; } data = mfn_to_virt(mfn); printk(KERN_INFO "va is 0x%lx\n", data); print_hex_dump_bytes("acpi:", DUMP_PREFIX_OFFSET, data, PAGE_SIZE); return 0; } static void __exit acpitest_exit(void) { } module_init(acpitest_init); module_exit(acpitest_exit);
On 06/20/2012 04:51 PM, Konrad Rzeszutek Wilk wrote:> On Wed, Jun 20, 2012 at 02:37:55PM +0200, Andre Przywara wrote:Konrad, thanks for looking at the problem. Replies inline...>> we have some problems with acpidump running on Xen Dom0. On 64 bit >> Dom0 it will trigger the OOM killer, on 32 bit Dom0s it will cause a >> kernel crash. >> The hypervisor does not matter, I tried 4.1.3-rc2 as well as various >> unstable versions including 25467, also 32-bit versions of 4.1. >> The Dom0 kernels were always PVOPS versions, the problems starts >> with 3.2-rc1~194 and is still in 3.5.0-rc3. >> Also you need to restrict the Dom0 memory with dom0_mem>> The crash says (on a 3.4.3 32bit Dom0 kernel): >> uruk:~ # ./acpidump32 >> [ 158.843444] ------------[ cut here ]------------ >> [ 158.843460] kernel BUG at mm/rmap.c:1027! >> [ 158.843466] invalid opcode: 0000 [#1] SMP >> [ 158.843472] Modules linked in: >> [ 158.843478] >> [ 158.843483] Pid: 4874, comm: acpidump32 Tainted: G W >> 3.4.0+ #105 empty empty/S3993 >> [ 158.843493] EIP: 0061:[<c10b0e27>] EFLAGS: 00010246 CPU: 3 >> [ 158.843505] EIP is at __page_set_anon_rmap+0x12/0x45 >> [ 158.843511] EAX: d6022dc0 EBX: dfecb6e0 ECX: b76faf64 EDX: b76faf64 >> [ 158.843516] ESI: 00000000 EDI: b76faf64 EBP: d6091e8c ESP: d6091e84 >> [ 158.843522] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069 >> [ 158.843529] CR0: 8005003b CR2: b76faf64 CR3: 17633000 CR4: 00000660 >> [ 158.843535] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 >> [ 158.843581] DR6: ffff0ff0 DR7: 00000400 >> [ 158.843586] Process acpidump32 (pid: 4874, ti=d6090000 >> task=d60b34f0 task.ti=d6090000) >> [ 158.843591] Stack: >> [ 158.843594] dfecb6e0 00000001 d6091ea8 c10b15c4 00000000 >> d6022dc0 d61fbdd8 d6022dc0 >> [ 158.843610] 00000000 d6091efc c10aacbe 00000000 99948025 >> 80000001 d8aa1f80 80000001 >> [ 158.843631] dfefc800 00000000 d8aa1f80 00000000 166b7025 >> d7f407d0 b76faf64 99948025 >> [ 158.843649] Call Trace: >> [ 158.843656] [<c10b15c4>] do_page_add_anon_rmap+0x5b/0x64 >> [ 158.843664] [<c10aacbe>] handle_pte_fault+0x81d/0xa06 >> [ 158.843674] [<c10ab0ff>] handle_mm_fault+0x1fa/0x209 >> [ 158.843683] [<c159e4e8>] ? spurious_fault+0x104/0x104 >> [ 158.843688] [<c159e881>] do_page_fault+0x399/0x3b4 >> [ 158.843696] [<c10c639d>] ? filp_close+0x55/0x5f >> [ 158.843701] [<c10c6408>] ? sys_close+0x61/0xa0 >> [ 158.843706] [<c159e4e8>] ? spurious_fault+0x104/0x104 >> [ 158.843714] [<c159c452>] error_code+0x5a/0x60 >> [ 158.843720] [<c159e4e8>] ? spurious_fault+0x104/0x104 >> [ 158.843724] Code: e8 45 91 00 00 89 c2 eb 09 2b 50 04 c1 ea 0c 03 >> 50 4c 89 53 08 5b 5e 5d c3 55 89 e5 56 53 89 c3 89 d0 89 ca 8b 70 44 >> 85 f6 75 02<0f> 0b f6 43 04 01 75 27 83 7d 08 00 75 02 8b 36 46 89 >> 73 04 f6 >> [ 158.843824] EIP: [<c10b0e27>] __page_set_anon_rmap+0x12/0x45 >> SS:ESP 0069:d6091e84 >> [ 158.843848] ---[ end trace 4eaa2a86a8e2da24 ]--- >> [ 158.843854] note: acpidump32[4874] exited with preempt_count 1 >> >> >> On 64bit the OOM goes around, finally killing the login shell: >> uruk:~ # ./acpidump_inst >> acpi_map_memory(917504, 131072); >> opened /dev/mem (fd=3) >> calling mmap(NULL, 131072, PROT_READ, MAP_PRIVATE, fd, e0000); >> mmap returned 0xf7571000, function returns 0xf7571000 >> acpi_map_table(cfef0f64, "XSDT"); >> acpi_map_memory(3488550756, 36); >> opened /dev/mem (fd=3) >> calling mmap(NULL, 3976, PROT_READ, MAP_PRIVATE, fd, cfef0000); >> mmap returned 0xf76fd000, function returns 0xf76fdf64 >> having mapped table header >> reading signature: >> >> Welcome to SUSE Linux Enterprise Server 11 SP1 (i586) - Kernel >> 3.5.0-rc3+ (hvc0). >> >> uruk login: >> ----------- >> This dump shows that the bug happens the moment acpidump accesses >> the mmapped ACPI table at @cfef0000 (the lower map at e0000 works). > > What is the e0000 one? I don''t see in your E820 the region being > reserved?E0000 is the below 1 MB BIOS area with the ACPI RSDP root pointer. E000:0000 in old DOS speak. The ACPI spec says that the pointer to the tables is hidden somewhere between 896K and 1MB at 16 byte granularity. acpidump scans this area for the ACPI magic number. So mapping /dev/mem is not fully broken, as this part at least works.>> >> This is extra unfortunate as in SLES11 acpidump will be called by >> the kbd init script (querying the BIOS NumLock setting!) > > Ah. Is the acpidump somewhere easily available to compile? Should > I get it from here: > http://www.lesswatts.org/projects/acpi/utilities.phpRight, it is in the pmtools-20071116.tar.gz archive. Just say make in the acpidump directory.>> >> I bisected the Dom0 kernel to find this one (v3.2-rc~194): >> commit 5eef150c1d7e41baaefd00dd56c153debcd86aee >> Merge: 315eb8a f3f436e >> Author: Linus Torvalds<torvalds@linux-foundation.org> >> Date: Tue Oct 25 09:17:07 2011 +0200 >> >> Merge branch ''stable/e820-3.2'' of >> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen >> >> * ''stable/e820-3.2'' of >> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: > > Oh boy. v3.2 .. that is eons ago! :-)Tell that those 2.6.32 or even 2.6.18 users...> >> xen: release all pages within 1-1 p2m mappings >> xen: allow extra memory to be in multiple regions >> xen: allow balloon driver to use more than one memory region >> xen/balloon: simplify test for the end of usable RAM >> xen/balloon: account for pages released during memory setup >> >> >> I tried to find something obvious, but to no avail. At least the new >> E820 looks sane, nothing that would prevent the mapping of the >> requested regions. Reverting this commit will not work easily on >> newer kernels, also is probably not desirable. > > The one thing that comes to my mind is the 1-1 mapping having > some issues. Can you boot the kernel with ''debug loglevel=8''. That should > print something like this: > > Setting pfn cfef0->cfef7 to 1-1 > or such during bootup.Hmm, I couldn''t trigger such messages. Do I need some magic config to enable them? So far I have (among others): CONFIG_DEBUG_KERNEL=y CONFIG_DEBUG_VM=y CONFIG_DEBUG_VIRTUAL=y CONFIG_DEBUG_MEMORY_INIT=y>> >> But it does not show on every machine here, so the machine E820 >> could actually be a differentiator. This particular box was a dual >> socket Barcelona server with 12GB of memory. >> >> This whole PV memory management goes beyond my knowledge, so I''d >> like to ask for help on this issue. >> If you need more information (I attached the boot log, which shows >> the two E820 tables), please ask. I can also quickly do some >> experiments if needed. > > This is strange one - the P2M code should fetch the MFN (so it should > give you cfef0) whenever anybody asks for that. Lets double-check that. > > Can you try this little module?Right, it chokes. Mapping memory below 1MB works: # insmod testxenmap.ko pfn=0xf8 # rmmod testxenmap # dmesg ... [ 60.369526] va is 0xffff8800000f8000 [ 60.369533] acpi:00000000: 80 dc 0f 00 00 ff 00 00 00 00 00 00 00 00 00 00 ................ [ 60.369536] acpi:00000010: 52 53 44 20 50 54 52 20 4a 50 54 4c 54 44 20 02 RSD PTR JPTLTD . [ 60.369538] acpi:00000020: 20 0f ef cf 24 00 00 00 64 0f ef cf 00 00 00 00 ...$...d....... .... you see the magic "RSD PTR " string here, at 0x20 the 32bit address of the actual tables (0xcfef0f20), which we try next: # insmod testxenmap.ko pfn=0xcfef0 insmod: error inserting ''testxenmap.ko'': -1 Invalid parameters # dmesg .... [ 351.964914] ------------[ cut here ]------------ [ 351.964924] WARNING: at /src/linux-2.6/xentest/testxenmap.c:24 acpitest_init+0x5e/0x1000 [testxenmap]() [ 351.964926] Hardware name: empty [ 351.964928] We get cfef0 instead of ffffffffffffffff! [ 351.964933] Modules linked in: testxenmap(O+) [last unloaded: testxenmap] [ 351.964936] Pid: 4937, comm: insmod Tainted: G W O 3.5.0-rc3+ #106 [ 351.964938] Call Trace: [ 351.964944] [<ffffffffa000a05e>] ? acpitest_init+0x5e/0x1000 [testxenmap] [ 351.964953] [<ffffffff81050747>] warn_slowpath_common+0x80/0x98 [ 351.964956] [<ffffffffa000a000>] ? 0xffffffffa0009fff [ 351.964959] [<ffffffff810507f3>] warn_slowpath_fmt+0x41/0x43 [ 351.964963] [<ffffffffa000a05e>] acpitest_init+0x5e/0x1000 [testxenmap] [ 351.964966] [<ffffffffa000a000>] ? 0xffffffffa0009fff [ 351.964971] [<ffffffff8100215a>] do_one_initcall+0x7a/0x134 [ 351.964976] [<ffffffff81094512>] sys_init_module+0xbf/0x24b [ 351.964982] [<ffffffff816bb826>] cstar_dispatch+0x7/0x21 [ 351.964985] ---[ end trace 4eaa2a86a8e2da24 ]--- [ 351.964987] raw p2m (cfef0) gives us: ffffffffffffffff starting the kernel without dom0_mem (where acpidump works flawlessly) also makes the module crash, although only at the point dumping the buffer (so this could be a different issue): # insmod testxenmap.ko pfn=0xcfef0 [ 243.071693] va is 0xffff8800cfef0000 [ 243.071710] BUG: unable to handle kernel paging request at ffff8800cfef0000 [ 243.071733] IP: [<ffffffff81275a22>] hex_dump_to_buffer+0x19c/0x282 [ 243.071742] PGD 1c0c067 PUD f5b067 PMD fdb067 PTE 0 [ 243.071748] Oops: 0000 [#1] SMP [ 243.071753] CPU 5 [ 243.071757] Modules linked in: testxenmap(O+) [last unloaded: testxenmap] [ 243.071762] [ 243.071768] Pid: 4825, comm: insmod Tainted: G W O 3.5.0-rc3+ #106 empty empty/S3993 [ 243.071777] RIP: e030:[<ffffffff81275a22>] [<ffffffff81275a22>] hex_dump_to_buffer+0x19c/0x282 [ 243.071783] RSP: e02b:ffff880312e2fd58 EFLAGS: 00010203 ... Hope that helps and thanks! Andre.> [not compile tested]ACK ;-)> > #include<linux/module.h> > #include<linux/kthread.h> > #include<linux/pagemap.h> > #include<linux/init.h> > #include<xen/xen.h>+ #include<xen/page.h>> #define ACPITEST "0.1" > > MODULE_AUTHOR("Konrad Rzeszutek Wilk<konrad.wilk@oracle.com>"); > MODULE_DESCRIPTION("acpitest"); > MODULE_LICENSE("GPL"); > MODULE_VERSION(ACPITEST); >+unsigned long pfn = 0xcfef0; +module_param(pfn, ulong, 0644); +MODULE_PARM_DESC(pfn, "pfn to test"); +> static int __init acpitest_init(void) > {- unsigned int pfn = 0xcfef0; - unsigned int mfn; + unsigned long mfn;> void *data; > > mfn = pfn_to_mfn(pfn);- WARN_ON(pfn != mfn, "We get %lx instead of %lx!\n", pfn, mfn); + WARN(pfn != mfn, "We get %lx instead of %lx!\n", pfn, mfn);> if (pfn != mfn) { > printk(KERN_INFO "raw p2m (%lx) gives us: %lx\n", pfn, get_phys_to_machine(pfn)); > return -EINVAL; > } > data = mfn_to_virt(mfn);- printk(KERN_INFO "va is 0x%lx\n", data); + printk(KERN_INFO "va is 0x%p\n", data);> print_hex_dump_bytes("acpi:", DUMP_PREFIX_OFFSET, data, PAGE_SIZE); > > return 0; > } > static void __exit acpitest_exit(void) > { > } > module_init(acpitest_init); > module_exit(acpitest_exit); > >-- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany
> >>I tried to find something obvious, but to no avail. At least the new > >>E820 looks sane, nothing that would prevent the mapping of the > >>requested regions. Reverting this commit will not work easily on > >>newer kernels, also is probably not desirable. > > > >The one thing that comes to my mind is the 1-1 mapping having > >some issues. Can you boot the kernel with ''debug loglevel=8''. That should > >print something like this: > > > >Setting pfn cfef0->cfef7 to 1-1 > >or such during bootup. > > Hmm, I couldn''t trigger such messages. Do I need some magic config > to enable them? So far I have (among others): > CONFIG_DEBUG_KERNEL=y > CONFIG_DEBUG_VM=y > CONFIG_DEBUG_VIRTUAL=y > CONFIG_DEBUG_MEMORY_INIT=yThey should show up as part of the bootup process: # dmesg | head [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Linux version 3.5.0-rc4upstream-00211-g9acc7bd (konrad@build.dumpdata.com) (gcc version 4.4.4 20100503 (Red Hat 4.4.4-2) (GCC) ) #1 SMP Thu Jun 28 18:09:41 EDT 2012 [ 0.000000] Command line: debug memblock=debug console=tty console=hvc0 earlyprintk=xen loglevel=10 initcall_debug xen-pciback.hide=(04:00.0) [ 0.000000] Disabled fast string operations [ 0.000000] Freeing 9a-100 pfn range: 102 pages freed [ 0.000000] 1-1 mapping on 9a->100 [ 0.000000] Freeing 20000-20200 pfn range: 512 pages freed [ 0.000000] 1-1 mapping on 20000->20200 [ 0.000000] 1-1 mapping on 40000->40200> > >> > >>But it does not show on every machine here, so the machine E820 > >>could actually be a differentiator. This particular box was a dual > >>socket Barcelona server with 12GB of memory. > >> > >>This whole PV memory management goes beyond my knowledge, so I''d > >>like to ask for help on this issue. > >>If you need more information (I attached the boot log, which shows > >>the two E820 tables), please ask. I can also quickly do some > >>experiments if needed. > > > >This is strange one - the P2M code should fetch the MFN (so it should > >give you cfef0) whenever anybody asks for that. Lets double-check that. > > > >Can you try this little module? > > Right, it chokes. Mapping memory below 1MB works: > # insmod testxenmap.ko pfn=0xf8 > # rmmod testxenmap > # dmesg > ... > [ 60.369526] va is 0xffff8800000f8000 > [ 60.369533] acpi:00000000: 80 dc 0f 00 00 ff 00 00 00 00 00 00 00 > 00 00 00 ................ > [ 60.369536] acpi:00000010: 52 53 44 20 50 54 52 20 4a 50 54 4c 54 > 44 20 02 RSD PTR JPTLTD . > [ 60.369538] acpi:00000020: 20 0f ef cf 24 00 00 00 64 0f ef cf 00 > 00 00 00 ...$...d....... > .... > you see the magic "RSD PTR " string here, at 0x20 the 32bit address > of the actual tables (0xcfef0f20), which we try next: > # insmod testxenmap.ko pfn=0xcfef0 > insmod: error inserting ''testxenmap.ko'': -1 Invalid parameters > # dmesg > .... > [ 351.964914] ------------[ cut here ]------------ > [ 351.964924] WARNING: at /src/linux-2.6/xentest/testxenmap.c:24 > acpitest_init+0x5e/0x1000 [testxenmap]() > [ 351.964926] Hardware name: empty > [ 351.964928] We get cfef0 instead of ffffffffffffffff!Is cfef0 part of the 1-1 mapping and in ACPI? On my box I see this: # dmesg | head -30 | grep bc55 [ 0.000000] 1-1 mapping on bc558->bc5ac [ 0.000000] Xen: [mem 0x0000000040200000-0x00000000bc557fff] usable [ 0.000000] Xen: [mem 0x00000000bc558000-0x00000000bc560fff] ACPI data So the E820 has it marked a ACPI data and sure enough I also see this: [ 0.000000] ACPI: DSDT 00000000bc558168 079E1 (v02 INTEL DQ67SW 00000016 INTL 20051117) Let me see what I get with the little module.> [ 351.964933] Modules linked in: testxenmap(O+) [last unloaded: testxenmap] > [ 351.964936] Pid: 4937, comm: insmod Tainted: G W O > 3.5.0-rc3+ #106 > [ 351.964938] Call Trace: > [ 351.964944] [<ffffffffa000a05e>] ? acpitest_init+0x5e/0x1000 > [testxenmap] > [ 351.964953] [<ffffffff81050747>] warn_slowpath_common+0x80/0x98 > [ 351.964956] [<ffffffffa000a000>] ? 0xffffffffa0009fff > [ 351.964959] [<ffffffff810507f3>] warn_slowpath_fmt+0x41/0x43 > [ 351.964963] [<ffffffffa000a05e>] acpitest_init+0x5e/0x1000 [testxenmap] > [ 351.964966] [<ffffffffa000a000>] ? 0xffffffffa0009fff > [ 351.964971] [<ffffffff8100215a>] do_one_initcall+0x7a/0x134 > [ 351.964976] [<ffffffff81094512>] sys_init_module+0xbf/0x24b > [ 351.964982] [<ffffffff816bb826>] cstar_dispatch+0x7/0x21 > [ 351.964985] ---[ end trace 4eaa2a86a8e2da24 ]--- > [ 351.964987] raw p2m (cfef0) gives us: ffffffffffffffff > > starting the kernel without dom0_mem (where acpidump works > flawlessly) also makes the module crash, although only at the point > dumping the buffer (so this could be a different issue):Yeah, that is b/c the pfn_to_mfn is trying to use an tree that woudl not be initialized.
> > [ 351.964914] ------------[ cut here ]------------ > > [ 351.964924] WARNING: at /src/linux-2.6/xentest/testxenmap.c:24 > > acpitest_init+0x5e/0x1000 [testxenmap]() > > [ 351.964926] Hardware name: empty > > [ 351.964928] We get cfef0 instead of ffffffffffffffff! > > Is cfef0 part of the 1-1 mapping and in ACPI? On my box I see this: > > # dmesg | head -30 | grep bc55 > [ 0.000000] 1-1 mapping on bc558->bc5ac > [ 0.000000] Xen: [mem 0x0000000040200000-0x00000000bc557fff] usable > [ 0.000000] Xen: [mem 0x00000000bc558000-0x00000000bc560fff] ACPI data > > So the E820 has it marked a ACPI data and sure enough I also see this: > > [ 0.000000] ACPI: DSDT 00000000bc558168 079E1 (v02 INTEL DQ67SW 00000016 INTL 20051117) > > Let me see what I get with the little module.So: [ 0.000000] 1-1 mapping on 9a->100 [ 0.000000] 1-1 mapping on 20000->20200 [ 0.000000] 1-1 mapping on 40000->40200 [ 0.000000] 1-1 mapping on bc558->bc5ac [ 0.000000] 1-1 mapping on bc5b4->bc8c5 [ 0.000000] 1-1 mapping on bc8c6->bcb7c [ 0.000000] 1-1 mapping on bcd00->100000> dmesg | grep ACPI: | head[ 0.000000] ACPI: RSDP 00000000000f0450 00024 (v02 INTEL) [ 0.000000] ACPI: XSDT 00000000bc558070 00064 (v01 INTEL DQ67SW 01072009 AMI 00010013) [ 0.000000] ACPI: FACP 00000000bc55fb50 000F4 (v04 INTEL DQ67SW 01072009 AMI 00010013) [ 0.000000] ACPI: DSDT 00000000bc558168 079E1 (v02 INTEL DQ67SW 00000016 INTL 20051117) [ 0.000000] ACPI: FACS 00000000bc8dbf80 00040 [ 0.000000] ACPI: APIC 00000000bc55fc48 00072 (v03 INTEL DQ67SW 01072009 AMI 00010013) [ 0.000000] ACPI: TCPA 00000000bc55fcc0 00032 (v02 INTEL DQ67SW 00000001 MSFT 01000013) [ 0.000000] ACPI: SSDT 00000000bc55fcf8 00102 (v01 INTEL DQ67SW 00000001 MSFT 03000001) [ 0.000000] ACPI: MCFG 00000000bc55fe00 0003C (v01 INTEL DQ67SW 01072009 MSFT 00000097) [ 0.000000] ACPI: HPET 00000000bc55fe40 00038 (v01 INTEL DQ67SW 01072009 AMI. 00000004) 02:11:06 # 42 :~/> rmmod acpidump;insmod /acpidump.ko pfn=0xbc55e02:11:15 # 43 :~/> rmmod acpidump;insmod /acpidump.ko pfn=0xbc55902:11:26 # 44 :~/> rmmod acpidump;insmod /acpidump.ko pfn=0xbc558insmod: error inserting ''/acpidump.ko'': -1 Invalid parameters 2:16:37 # 8 :/data/> insmod /acpidump.ko pfn=0xbc5acinsmod: error inserting ''/acpidump.ko'': -1 Invalid parameters 02:16:45 # 10 :/data/> dmesg | grep p2m[ 389.847683] raw p2m (bc558) gives us: ffffffffffffffff [ 701.348502] raw p2m (bc5ac) gives us: ffffffffffffffff Huh? Looks like I can access the ACPI regions (bc559 had a bunch of stuff), but _not_ on the boundary PFNs. Plot thickens - but sadly I won''t be able to do much until Thursday. I think the issue is somewhere in set_phys_range_identity. This loop: 767 for (pfn = pfn_s; pfn < pfn_e; pfn++) 768 if (!__set_phys_to_machine(pfn, IDENTITY_FRAME(pfn))) 769 break; 770 Probably needs pfn <= pfn_e. But that still does not explain why pfn_s is failing. Or maybe in the pfn_to_mfn machinary. It certainly has a lot of overrides in it. If you were to instrument any of those to print out more details on the offending PFNs that could help.
On 20/06/12 13:37, Andre Przywara wrote:> Hi, > > we have some problems with acpidump running on Xen Dom0. On 64 bit Dom0 > it will trigger the OOM killer, on 32 bit Dom0s it will cause a kernel > crash. > The hypervisor does not matter, I tried 4.1.3-rc2 as well as various > unstable versions including 25467, also 32-bit versions of 4.1. > The Dom0 kernels were always PVOPS versions, the problems starts with > 3.2-rc1~194 and is still in 3.5.0-rc3. > Also you need to restrict the Dom0 memory with dom0_memThis is odd. Can you try a range of dom0_mem settings to see which values cause this? David
On 06/30/2012 04:19 AM, Konrad Rzeszutek Wilk wrote: Konrad, David, back on track for this issue. Thanks for your input, I could do some more debugging (see below for a refresh): It seems like it affects only the first page of the 1:1 mapping. I didn''t have an issues with the last PFN or the page behind it (which failed properly). David, thanks for the hint with varying dom0_mem parameter. I thought I already checked this, but I did it once again and it turned out that it is only an issue if dom0_mem is smaller than the ACPI area, which generates a hole in the memory map. So we have (simplified) * 1:1 mapping to 1 MB * normal mapping till dom0_mem * unmapped area till ACPI E820 area * ACPI E820 1:1 mapping As far as I could chase it down the 1:1 mapping itself looks OK, I couldn''t find any off-by-one bugs here. So maybe it is code that later on invalidates areas between the normal guest mapping and the ACPI mem? Hope that helps, I will also try to find more about this. Thanks, Andre.>>> [ 351.964914] ------------[ cut here ]------------ >>> [ 351.964924] WARNING: at /src/linux-2.6/xentest/testxenmap.c:24 >>> acpitest_init+0x5e/0x1000 [testxenmap]() >>> [ 351.964926] Hardware name: empty >>> [ 351.964928] We get cfef0 instead of ffffffffffffffff! >> >> Is cfef0 part of the 1-1 mapping and in ACPI? On my box I see this: >> >> # dmesg | head -30 | grep bc55 >> [ 0.000000] 1-1 mapping on bc558->bc5ac >> [ 0.000000] Xen: [mem 0x0000000040200000-0x00000000bc557fff] usable >> [ 0.000000] Xen: [mem 0x00000000bc558000-0x00000000bc560fff] ACPI data >> >> So the E820 has it marked a ACPI data and sure enough I also see this: >> >> [ 0.000000] ACPI: DSDT 00000000bc558168 079E1 (v02 INTEL DQ67SW 00000016 INTL 20051117) >> >> Let me see what I get with the little module. > > So: > [ 0.000000] 1-1 mapping on 9a->100 > [ 0.000000] 1-1 mapping on 20000->20200 > [ 0.000000] 1-1 mapping on 40000->40200 > [ 0.000000] 1-1 mapping on bc558->bc5ac > [ 0.000000] 1-1 mapping on bc5b4->bc8c5 > [ 0.000000] 1-1 mapping on bc8c6->bcb7c > [ 0.000000] 1-1 mapping on bcd00->100000 > >> dmesg | grep ACPI: | head > [ 0.000000] ACPI: RSDP 00000000000f0450 00024 (v02 INTEL) > [ 0.000000] ACPI: XSDT 00000000bc558070 00064 (v01 INTEL DQ67SW 01072009 AMI 00010013) > [ 0.000000] ACPI: FACP 00000000bc55fb50 000F4 (v04 INTEL DQ67SW 01072009 AMI 00010013) > [ 0.000000] ACPI: DSDT 00000000bc558168 079E1 (v02 INTEL DQ67SW 00000016 INTL 20051117) > [ 0.000000] ACPI: FACS 00000000bc8dbf80 00040 > [ 0.000000] ACPI: APIC 00000000bc55fc48 00072 (v03 INTEL DQ67SW 01072009 AMI 00010013) > [ 0.000000] ACPI: TCPA 00000000bc55fcc0 00032 (v02 INTEL DQ67SW 00000001 MSFT 01000013) > [ 0.000000] ACPI: SSDT 00000000bc55fcf8 00102 (v01 INTEL DQ67SW 00000001 MSFT 03000001) > [ 0.000000] ACPI: MCFG 00000000bc55fe00 0003C (v01 INTEL DQ67SW 01072009 MSFT 00000097) > [ 0.000000] ACPI: HPET 00000000bc55fe40 00038 (v01 INTEL DQ67SW 01072009 AMI. 00000004) > > 02:11:06 # 42 :~/ >> rmmod acpidump;insmod /acpidump.ko pfn=0xbc55e > > 02:11:15 # 43 :~/ >> rmmod acpidump;insmod /acpidump.ko pfn=0xbc559 > > 02:11:26 # 44 :~/ >> rmmod acpidump;insmod /acpidump.ko pfn=0xbc558 > insmod: error inserting ''/acpidump.ko'': -1 Invalid parameters > > 2:16:37 # 8 :/data/ >> insmod /acpidump.ko pfn=0xbc5ac > insmod: error inserting ''/acpidump.ko'': -1 Invalid parameters > > 02:16:45 # 10 :/data/ >> dmesg | grep p2m > [ 389.847683] raw p2m (bc558) gives us: ffffffffffffffff > [ 701.348502] raw p2m (bc5ac) gives us: ffffffffffffffff > > Huh? Looks like I can access the ACPI regions (bc559 had a bunch of stuff), > but _not_ on the boundary PFNs. > > Plot thickens - but sadly I won''t be able to do much until Thursday. > > I think the issue is somewhere in set_phys_range_identity. This > loop: > 767 for (pfn = pfn_s; pfn < pfn_e; pfn++) > 768 if (!__set_phys_to_machine(pfn, IDENTITY_FRAME(pfn))) > 769 break; > 770 > > Probably needs pfn <= pfn_e. But that still does not explain > why pfn_s is failing. > > Or maybe in the pfn_to_mfn machinary. It certainly has a lot of > overrides in it. If you were to instrument any of those to print > out more details on the offending PFNs that could help. > >-- Andre Przywara AMD-OSRC (Dresden) Tel: x29712
On Thu, Jul 26, 2012 at 03:02:58PM +0200, Andre Przywara wrote:> On 06/30/2012 04:19 AM, Konrad Rzeszutek Wilk wrote: > > Konrad, David, > > back on track for this issue. Thanks for your input, I could do some > more debugging (see below for a refresh): > > It seems like it affects only the first page of the 1:1 mapping. I > didn''t have an issues with the last PFN or the page behind it (which > failed properly). > > David, thanks for the hint with varying dom0_mem parameter. I > thought I already checked this, but I did it once again and it > turned out that it is only an issue if dom0_mem is smaller than the > ACPI area, which generates a hole in the memory map. So we have > (simplified) > * 1:1 mapping to 1 MB > * normal mapping till dom0_mem > * unmapped area till ACPI E820 area > * ACPI E820 1:1 mapping > > As far as I could chase it down the 1:1 mapping itself looks OK, I > couldn''t find any off-by-one bugs here. So maybe it is code that > later on invalidates areas between the normal guest mapping and the > ACPI mem?I think I found it. Can you try this pls [and if you can''t find early_to_phys.. just use the __set_phys_to call] From ab915d98f321b0fcca1932747c632b5f0f299f55 Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Date: Fri, 17 Aug 2012 16:43:28 -0400 Subject: [PATCH] xen/setup: Fix one-off error when adding for-balloon PFNs to the P2M. When we are finished with return PFNs to the hypervisor, then populate it back, and also mark the E820 MMIO and E820 gaps as IDENTITY_FRAMEs, we then call P2M to set areas that can be used for ballooning. We were off by one, and ended up over-writting a P2M entry that most likely was an IDENTITY_FRAME. For example: 1-1 mapping on 40000->40200 1-1 mapping on bc558->bc5ac 1-1 mapping on bc5b4->bc8c5 1-1 mapping on bc8c6->bcb7c 1-1 mapping on bcd00->100000 Released 614 pages of unused memory Set 277889 page(s) to 1-1 mapping Populating 40200-40466 pfn range: 614 pages added => here we set from 40466 up to bc559 P2M tree to be INVALID_P2M_ENTRY. We should have done it up to bc558. The end result is that if anybody is trying to construct a PTE for PFN bc558 they end up with ~PAGE_PRESENT. CC: stable@vger.kernel.org Reported-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- arch/x86/xen/setup.c | 11 +++++++++-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c index ead8557..030a55a 100644 --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -78,9 +78,16 @@ static void __init xen_add_extra_mem(u64 start, u64 size) memblock_reserve(start, size); xen_max_p2m_pfn = PFN_DOWN(start + size); + for (pfn = PFN_DOWN(start); pfn < xen_max_p2m_pfn; pfn++) { + unsigned long mfn = pfn_to_mfn(pfn); + + if (WARN(mfn == pfn, "Trying to over-write 1-1 mapping (pfn: %lx)\n", pfn)) + continue; + WARN(mfn != INVALID_P2M_ENTRY, "Trying to remove %lx which has %lx mfn!\n", + pfn, mfn); - for (pfn = PFN_DOWN(start); pfn <= xen_max_p2m_pfn; pfn++) - __set_phys_to_machine(pfn, INVALID_P2M_ENTRY); + early_set_phys_to_machine(pfn, INVALID_P2M_ENTRY); + } } static unsigned long __init xen_do_chunk(unsigned long start, -- 1.7.7.6
On 08/17/2012 10:52 PM, Konrad Rzeszutek Wilk wrote:> On Thu, Jul 26, 2012 at 03:02:58PM +0200, Andre Przywara wrote: >> On 06/30/2012 04:19 AM, Konrad Rzeszutek Wilk wrote: >> >> Konrad, David, >> >> back on track for this issue. Thanks for your input, I could do some >> more debugging (see below for a refresh): >> >> It seems like it affects only the first page of the 1:1 mapping. I >> didn''t have an issues with the last PFN or the page behind it (which >> failed properly). >> >> David, thanks for the hint with varying dom0_mem parameter. I >> thought I already checked this, but I did it once again and it >> turned out that it is only an issue if dom0_mem is smaller than the >> ACPI area, which generates a hole in the memory map. So we have >> (simplified) >> * 1:1 mapping to 1 MB >> * normal mapping till dom0_mem >> * unmapped area till ACPI E820 area >> * ACPI E820 1:1 mapping >> >> As far as I could chase it down the 1:1 mapping itself looks OK, I >> couldn''t find any off-by-one bugs here. So maybe it is code that >> later on invalidates areas between the normal guest mapping and the >> ACPI mem? > > I think I found it. Can you try this pls [and if you can''t find > early_to_phys.. just use the __set_phys_to call]Yes, that works. At least after a quick test on my test box. Both the test module and acpidump work as expected. If I replace the "<" in your patch with the original "<=", I get the warning (and due to the "continue" it also works). I also successfully tested the minimal fix (just replacing <= with <). I will feed it to the testers here to cover more machines. Do you want to keep the warnings in (which exceed 80 characters, btw)? Thanks a lot and: Tested-by: Andre Przywara <andre.przywara@amd.com> Regards, Andre.> > From ab915d98f321b0fcca1932747c632b5f0f299f55 Mon Sep 17 00:00:00 2001 > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Date: Fri, 17 Aug 2012 16:43:28 -0400 > Subject: [PATCH] xen/setup: Fix one-off error when adding for-balloon PFNs to > the P2M. > > When we are finished with return PFNs to the hypervisor, then > populate it back, and also mark the E820 MMIO and E820 gaps > as IDENTITY_FRAMEs, we then call P2M to set areas that can > be used for ballooning. We were off by one, and ended up > over-writting a P2M entry that most likely was an IDENTITY_FRAME. > For example: > > 1-1 mapping on 40000->40200 > 1-1 mapping on bc558->bc5ac > 1-1 mapping on bc5b4->bc8c5 > 1-1 mapping on bc8c6->bcb7c > 1-1 mapping on bcd00->100000 > Released 614 pages of unused memory > Set 277889 page(s) to 1-1 mapping > Populating 40200-40466 pfn range: 614 pages added > > => here we set from 40466 up to bc559 P2M tree to be > INVALID_P2M_ENTRY. We should have done it up to bc558. > > The end result is that if anybody is trying to construct > a PTE for PFN bc558 they end up with ~PAGE_PRESENT. > > CC: stable@vger.kernel.org > Reported-by: Andre Przywara <andre.przywara@amd.com> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > --- > arch/x86/xen/setup.c | 11 +++++++++-- > 1 files changed, 9 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c > index ead8557..030a55a 100644 > --- a/arch/x86/xen/setup.c > +++ b/arch/x86/xen/setup.c > @@ -78,9 +78,16 @@ static void __init xen_add_extra_mem(u64 start, u64 size) > memblock_reserve(start, size); > > xen_max_p2m_pfn = PFN_DOWN(start + size); > + for (pfn = PFN_DOWN(start); pfn < xen_max_p2m_pfn; pfn++) { > + unsigned long mfn = pfn_to_mfn(pfn); > + > + if (WARN(mfn == pfn, "Trying to over-write 1-1 mapping (pfn: %lx)\n", pfn)) > + continue; > + WARN(mfn != INVALID_P2M_ENTRY, "Trying to remove %lx which has %lx mfn!\n", > + pfn, mfn); > > - for (pfn = PFN_DOWN(start); pfn <= xen_max_p2m_pfn; pfn++) > - __set_phys_to_machine(pfn, INVALID_P2M_ENTRY); > + early_set_phys_to_machine(pfn, INVALID_P2M_ENTRY); > + } > } > > static unsigned long __init xen_do_chunk(unsigned long start, >-- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany
On 23/08/12 11:14, Andre Przywara wrote:> On 08/17/2012 10:52 PM, Konrad Rzeszutek Wilk wrote: >> On Thu, Jul 26, 2012 at 03:02:58PM +0200, Andre Przywara wrote: >>> On 06/30/2012 04:19 AM, Konrad Rzeszutek Wilk wrote: >>> >>> Konrad, David, >>> >>> back on track for this issue. Thanks for your input, I could do some >>> more debugging (see below for a refresh): >>> >>> It seems like it affects only the first page of the 1:1 mapping. I >>> didn''t have an issues with the last PFN or the page behind it (which >>> failed properly). >>> >>> David, thanks for the hint with varying dom0_mem parameter. I >>> thought I already checked this, but I did it once again and it >>> turned out that it is only an issue if dom0_mem is smaller than the >>> ACPI area, which generates a hole in the memory map. So we have >>> (simplified) >>> * 1:1 mapping to 1 MB >>> * normal mapping till dom0_mem >>> * unmapped area till ACPI E820 area >>> * ACPI E820 1:1 mapping >>> >>> As far as I could chase it down the 1:1 mapping itself looks OK, I >>> couldn''t find any off-by-one bugs here. So maybe it is code that >>> later on invalidates areas between the normal guest mapping and the >>> ACPI mem? >> >> I think I found it. Can you try this pls [and if you can''t find >> early_to_phys.. just use the __set_phys_to call] > > Yes, that works. At least after a quick test on my test box. Both the > test module and acpidump work as expected. If I replace the "<" in your > patch with the original "<=", I get the warning (and due to the > "continue" it also works).Note that the balloon driver could subsequently overwrite the p2m entry. I don''t think it is worth redoing the patch to adjust the region passed to the balloon driver to avoid this though.> I also successfully tested the minimal fix (just replacing <= with <). > I will feed it to the testers here to cover more machines. > > Do you want to keep the warnings in (which exceed 80 characters, btw)?I think we do. David
On Thu, Aug 23, 2012 at 12:14:56PM +0200, Andre Przywara wrote:> On 08/17/2012 10:52 PM, Konrad Rzeszutek Wilk wrote: > >On Thu, Jul 26, 2012 at 03:02:58PM +0200, Andre Przywara wrote: > >>On 06/30/2012 04:19 AM, Konrad Rzeszutek Wilk wrote: > >> > >>Konrad, David, > >> > >>back on track for this issue. Thanks for your input, I could do some > >>more debugging (see below for a refresh): > >> > >>It seems like it affects only the first page of the 1:1 mapping. I > >>didn''t have an issues with the last PFN or the page behind it (which > >>failed properly). > >> > >>David, thanks for the hint with varying dom0_mem parameter. I > >>thought I already checked this, but I did it once again and it > >>turned out that it is only an issue if dom0_mem is smaller than the > >>ACPI area, which generates a hole in the memory map. So we have > >>(simplified) > >>* 1:1 mapping to 1 MB > >>* normal mapping till dom0_mem > >>* unmapped area till ACPI E820 area > >>* ACPI E820 1:1 mapping > >> > >>As far as I could chase it down the 1:1 mapping itself looks OK, I > >>couldn''t find any off-by-one bugs here. So maybe it is code that > >>later on invalidates areas between the normal guest mapping and the > >>ACPI mem? > > > >I think I found it. Can you try this pls [and if you can''t find > >early_to_phys.. just use the __set_phys_to call] > > Yes, that works. At least after a quick test on my test box. Both > the test module and acpidump work as expected. If I replace the "<" > in your patch with the original "<=", I get the warning (and due to > the "continue" it also works). > I also successfully tested the minimal fix (just replacing <= with <). > I will feed it to the testers here to cover more machines. > > Do you want to keep the warnings in (which exceed 80 characters, btw)?Yes. The new style is to allow any type of printk/WARN etc to be unbroken and break the 80 characters.> > Thanks a lot and: > > Tested-by: Andre Przywara <andre.przywara@amd.com>Great. Thx.> > Regards, > Andre. > > > > > From ab915d98f321b0fcca1932747c632b5f0f299f55 Mon Sep 17 00:00:00 2001 > >From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > >Date: Fri, 17 Aug 2012 16:43:28 -0400 > >Subject: [PATCH] xen/setup: Fix one-off error when adding for-balloon PFNs to > > the P2M. > > > >When we are finished with return PFNs to the hypervisor, then > >populate it back, and also mark the E820 MMIO and E820 gaps > >as IDENTITY_FRAMEs, we then call P2M to set areas that can > >be used for ballooning. We were off by one, and ended up > >over-writting a P2M entry that most likely was an IDENTITY_FRAME. > >For example: > > > >1-1 mapping on 40000->40200 > >1-1 mapping on bc558->bc5ac > >1-1 mapping on bc5b4->bc8c5 > >1-1 mapping on bc8c6->bcb7c > >1-1 mapping on bcd00->100000 > >Released 614 pages of unused memory > >Set 277889 page(s) to 1-1 mapping > >Populating 40200-40466 pfn range: 614 pages added > > > >=> here we set from 40466 up to bc559 P2M tree to be > >INVALID_P2M_ENTRY. We should have done it up to bc558. > > > >The end result is that if anybody is trying to construct > >a PTE for PFN bc558 they end up with ~PAGE_PRESENT. > > > >CC: stable@vger.kernel.org > >Reported-by: Andre Przywara <andre.przywara@amd.com> > >Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > >--- > > arch/x86/xen/setup.c | 11 +++++++++-- > > 1 files changed, 9 insertions(+), 2 deletions(-) > > > >diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c > >index ead8557..030a55a 100644 > >--- a/arch/x86/xen/setup.c > >+++ b/arch/x86/xen/setup.c > >@@ -78,9 +78,16 @@ static void __init xen_add_extra_mem(u64 start, u64 size) > > memblock_reserve(start, size); > > > > xen_max_p2m_pfn = PFN_DOWN(start + size); > >+ for (pfn = PFN_DOWN(start); pfn < xen_max_p2m_pfn; pfn++) { > >+ unsigned long mfn = pfn_to_mfn(pfn); > >+ > >+ if (WARN(mfn == pfn, "Trying to over-write 1-1 mapping (pfn: %lx)\n", pfn)) > >+ continue; > >+ WARN(mfn != INVALID_P2M_ENTRY, "Trying to remove %lx which has %lx mfn!\n", > >+ pfn, mfn); > > > >- for (pfn = PFN_DOWN(start); pfn <= xen_max_p2m_pfn; pfn++) > >- __set_phys_to_machine(pfn, INVALID_P2M_ENTRY); > >+ early_set_phys_to_machine(pfn, INVALID_P2M_ENTRY); > >+ } > > } > > > > static unsigned long __init xen_do_chunk(unsigned long start, > > > > -- > Andre Przywara > AMD-Operating System Research Center (OSRC), Dresden, Germany
On Thu, Aug 23, 2012 at 11:22:29AM +0100, David Vrabel wrote:> On 23/08/12 11:14, Andre Przywara wrote: > > On 08/17/2012 10:52 PM, Konrad Rzeszutek Wilk wrote: > >> On Thu, Jul 26, 2012 at 03:02:58PM +0200, Andre Przywara wrote: > >>> On 06/30/2012 04:19 AM, Konrad Rzeszutek Wilk wrote: > >>> > >>> Konrad, David, > >>> > >>> back on track for this issue. Thanks for your input, I could do some > >>> more debugging (see below for a refresh): > >>> > >>> It seems like it affects only the first page of the 1:1 mapping. I > >>> didn''t have an issues with the last PFN or the page behind it (which > >>> failed properly). > >>> > >>> David, thanks for the hint with varying dom0_mem parameter. I > >>> thought I already checked this, but I did it once again and it > >>> turned out that it is only an issue if dom0_mem is smaller than the > >>> ACPI area, which generates a hole in the memory map. So we have > >>> (simplified) > >>> * 1:1 mapping to 1 MB > >>> * normal mapping till dom0_mem > >>> * unmapped area till ACPI E820 area > >>> * ACPI E820 1:1 mapping > >>> > >>> As far as I could chase it down the 1:1 mapping itself looks OK, I > >>> couldn''t find any off-by-one bugs here. So maybe it is code that > >>> later on invalidates areas between the normal guest mapping and the > >>> ACPI mem? > >> > >> I think I found it. Can you try this pls [and if you can''t find > >> early_to_phys.. just use the __set_phys_to call] > > > > Yes, that works. At least after a quick test on my test box. Both the > > test module and acpidump work as expected. If I replace the "<" in your > > patch with the original "<=", I get the warning (and due to the > > "continue" it also works). > > Note that the balloon driver could subsequently overwrite the p2m entry.Hmm, I am not seeing how.. the region that is passed in is right up to the PFN (I believe). And I did run with this patch over a couple of days with ballooning up and down. But maybe I missed something? Let me prep a patch that adds some more checks in the balloon driver just in case we do hit this.> I don''t think it is worth redoing the patch to adjust the region passed > to the balloon driver to avoid this though. > > > I also successfully tested the minimal fix (just replacing <= with <). > > I will feed it to the testers here to cover more machines. > > > > Do you want to keep the warnings in (which exceed 80 characters, btw)? > > I think we do. > > David
On Thu, Aug 23, 2012 at 03:36:31PM +0100, David Vrabel wrote:> On 23/08/12 15:10, Konrad Rzeszutek Wilk wrote: > > On Thu, Aug 23, 2012 at 11:22:29AM +0100, David Vrabel wrote: > >> On 23/08/12 11:14, Andre Przywara wrote: > >>> On 08/17/2012 10:52 PM, Konrad Rzeszutek Wilk wrote: > >>>> On Thu, Jul 26, 2012 at 03:02:58PM +0200, Andre Przywara wrote: > >>>>> On 06/30/2012 04:19 AM, Konrad Rzeszutek Wilk wrote: > >>>>> > >>>>> Konrad, David, > >>>>> > >>>>> back on track for this issue. Thanks for your input, I could do some > >>>>> more debugging (see below for a refresh): > >>>>> > >>>>> It seems like it affects only the first page of the 1:1 mapping. I > >>>>> didn''t have an issues with the last PFN or the page behind it (which > >>>>> failed properly). > >>>>> > >>>>> David, thanks for the hint with varying dom0_mem parameter. I > >>>>> thought I already checked this, but I did it once again and it > >>>>> turned out that it is only an issue if dom0_mem is smaller than the > >>>>> ACPI area, which generates a hole in the memory map. So we have > >>>>> (simplified) > >>>>> * 1:1 mapping to 1 MB > >>>>> * normal mapping till dom0_mem > >>>>> * unmapped area till ACPI E820 area > >>>>> * ACPI E820 1:1 mapping > >>>>> > >>>>> As far as I could chase it down the 1:1 mapping itself looks OK, I > >>>>> couldn''t find any off-by-one bugs here. So maybe it is code that > >>>>> later on invalidates areas between the normal guest mapping and the > >>>>> ACPI mem? > >>>> > >>>> I think I found it. Can you try this pls [and if you can''t find > >>>> early_to_phys.. just use the __set_phys_to call] > >>> > >>> Yes, that works. At least after a quick test on my test box. Both the > >>> test module and acpidump work as expected. If I replace the "<" in your > >>> patch with the original "<=", I get the warning (and due to the > >>> "continue" it also works). > >> > >> Note that the balloon driver could subsequently overwrite the p2m entry. > > > > Hmm, I am not seeing how.. the region that is passed in is right up to > > the PFN (I believe). And I did run with this patch over a couple of days > > with ballooning up and down. But maybe I missed something? > > Hrrm. I was sure I wrote "Note that the balloon driver could > subsequently overwrite the p2m entry /if/ this warning is triggered." > but it seems I did not. :/ > > i.e., if the warning is triggered, the xen_extra_mem region will be > incorrectly sized and the balloon driver will make use of the incorrect > region.Ah, that makes more sense. Yes we would do the overwritting part later on in that scenario.. which makes me wonder - if we did that in the past how come MMIO devices still worked! Some boxes have the gap/MMIO right at the edge of the E820_RAM - perhaps they silently coping with and we just never caught on this fact.> > David > > > Let me prep a patch that adds some more checks in the balloon driver > > just in case we do hit this. > > > >> I don''t think it is worth redoing the patch to adjust the region passed > >> to the balloon driver to avoid this though. > >> > >>> I also successfully tested the minimal fix (just replacing <= with <). > >>> I will feed it to the testers here to cover more machines. > >>> > >>> Do you want to keep the warnings in (which exceed 80 characters, btw)? > >> > >> I think we do. > >> > >> David
On 23/08/12 15:10, Konrad Rzeszutek Wilk wrote:> On Thu, Aug 23, 2012 at 11:22:29AM +0100, David Vrabel wrote: >> On 23/08/12 11:14, Andre Przywara wrote: >>> On 08/17/2012 10:52 PM, Konrad Rzeszutek Wilk wrote: >>>> On Thu, Jul 26, 2012 at 03:02:58PM +0200, Andre Przywara wrote: >>>>> On 06/30/2012 04:19 AM, Konrad Rzeszutek Wilk wrote: >>>>> >>>>> Konrad, David, >>>>> >>>>> back on track for this issue. Thanks for your input, I could do some >>>>> more debugging (see below for a refresh): >>>>> >>>>> It seems like it affects only the first page of the 1:1 mapping. I >>>>> didn''t have an issues with the last PFN or the page behind it (which >>>>> failed properly). >>>>> >>>>> David, thanks for the hint with varying dom0_mem parameter. I >>>>> thought I already checked this, but I did it once again and it >>>>> turned out that it is only an issue if dom0_mem is smaller than the >>>>> ACPI area, which generates a hole in the memory map. So we have >>>>> (simplified) >>>>> * 1:1 mapping to 1 MB >>>>> * normal mapping till dom0_mem >>>>> * unmapped area till ACPI E820 area >>>>> * ACPI E820 1:1 mapping >>>>> >>>>> As far as I could chase it down the 1:1 mapping itself looks OK, I >>>>> couldn''t find any off-by-one bugs here. So maybe it is code that >>>>> later on invalidates areas between the normal guest mapping and the >>>>> ACPI mem? >>>> >>>> I think I found it. Can you try this pls [and if you can''t find >>>> early_to_phys.. just use the __set_phys_to call] >>> >>> Yes, that works. At least after a quick test on my test box. Both the >>> test module and acpidump work as expected. If I replace the "<" in your >>> patch with the original "<=", I get the warning (and due to the >>> "continue" it also works). >> >> Note that the balloon driver could subsequently overwrite the p2m entry. > > Hmm, I am not seeing how.. the region that is passed in is right up to > the PFN (I believe). And I did run with this patch over a couple of days > with ballooning up and down. But maybe I missed something?Hrrm. I was sure I wrote "Note that the balloon driver could subsequently overwrite the p2m entry /if/ this warning is triggered." but it seems I did not. :/ i.e., if the warning is triggered, the xen_extra_mem region will be incorrectly sized and the balloon driver will make use of the incorrect region. David> Let me prep a patch that adds some more checks in the balloon driver > just in case we do hit this. > >> I don''t think it is worth redoing the patch to adjust the region passed >> to the balloon driver to avoid this though. >> >>> I also successfully tested the minimal fix (just replacing <= with <). >>> I will feed it to the testers here to cover more machines. >>> >>> Do you want to keep the warnings in (which exceed 80 characters, btw)? >> >> I think we do. >> >> David
Apparently Analagous Threads
- [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
- [PATCH 0/4] ARM/early-printk: Improve reusability and add Calxeda support
- [PATCH]: xl: fix broken cpupool-numa-split
- [PATCH] x86/hvm: accelerate IO intercept handling
- [PATCH] libxc: fix tracing (broken with hypercall buffers)