Scott Garron
2011-Apr-26 00:04 UTC
[Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
It appears that a bug that I had reported back in August ( http://lists.xensource.com/archives/html/xen-devel/2010-08/msg01798.html ) seems to have been fixed in a recent update ( http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00742.html ) ... at least it looks like the same bug to me. I''ve been trying, over the past few days, to move to using a newer version with the fix applied on a test machine and once I get it working, would like to update my production server. It would be nice to be able to get back to doing backups again. :) I don''t know whether or not the fix for that bug worked, though, because I can''t seem to get any combination of a newer hypervisor and Jeremy''s xen.git xen/stable-2.6.32.x branch to boot at all. I can get the kernel to boot without Xen, but not with it. Here a snippet of serial output near the crash: [ 0.310132] xen_balloon: Initialising balloon driver with page order 0. [ 0.313482] last_pfn = 0x1d9ff0 max_arch_pfn = 0x400000000 [ 0.316665] BUG: unable to handle kernel paging request at ffffea0003800028 [ 0.316665] IP: [<ffffffff819a8aea>] balloon_init+0x20b/0x25e [ 0.316665] PGD c402067 PUD c403067 PMD 0 [ 0.316665] Oops: 0002 [#1] SMP Full serial output, lspci -vvv, .config, dmidecode, and grub2 conf can be retrieved here: http://www.pridelands.org/~simba/xen-debug/ git log: commit 057b171345d94c785d7e8bb21302c5d1a23ea048 Merge: 6dfaa17 145fff1 Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Date: Fri Apr 22 13:46:05 2011 -0700 git branch: xen/next-2.6.32 * xen/stable-2.6.32.x hg log: changeset: 23246:eb4505f8dd97 tag: tip user: Tim Deegan <Tim.Deegan@citrix.com> date: Wed Apr 20 12:02:51 2011 +0100 summary: xen/x86: re-enable xsave by default now that it supports live migration. I''ve tried booting with no dom0_mem option just to see if that made any difference. I also tried using the stable tarball of xen-4.1.0, the current xen-4.1-testing.hg, as well as xen-unstable.hg. All yield exactly the same results. Any ideas? -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-26 03:15 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On Mon, Apr 25, 2011 at 08:04:20PM -0400, Scott Garron wrote:> It appears that a bug that I had reported back in August > ( > http://lists.xensource.com/archives/html/xen-devel/2010-08/msg01798.html > ) seems to have been fixed in a recent update ( > http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00742.html > ) ... at least it looks like the same bug to me. I''ve been trying, overA bit different.> the past few days, to move to using a newer version with the fix applied > on a test machine and once I get it working, would like to update my > production server. It would be nice to be able to get back to doing > backups again. :) > > I don''t know whether or not the fix for that bug worked, though, > because I can''t seem to get any combination of a newer hypervisor and > Jeremy''s xen.git xen/stable-2.6.32.x branch to boot at all. I can get > the kernel to boot without Xen, but not with it. > > Here a snippet of serial output near the crash: > > [ 0.310132] xen_balloon: Initialising balloon driver with page order 0. > [ 0.313482] last_pfn = 0x1d9ff0 max_arch_pfn = 0x400000000 > [ 0.316665] BUG: unable to handle kernel paging request at > ffffea0003800028 > [ 0.316665] IP: [<ffffffff819a8aea>] balloon_init+0x20b/0x25eOK. That looks different.> [ 0.316665] PGD c402067 PUD c403067 PMD 0 > [ 0.316665] Oops: 0002 [#1] SMP > > Full serial output, lspci -vvv, .config, dmidecode, and grub2 conf can > be retrieved here: > > http://www.pridelands.org/~simba/xen-debug/Excellent.> > git log: > > commit 057b171345d94c785d7e8bb21302c5d1a23ea048 > Merge: 6dfaa17 145fff1 > Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> > Date: Fri Apr 22 13:46:05 2011 -0700 > > git branch: > xen/next-2.6.32 > * xen/stable-2.6.32.x > > > hg log: > changeset: 23246:eb4505f8dd97 > tag: tip > user: Tim Deegan <Tim.Deegan@citrix.com> > date: Wed Apr 20 12:02:51 2011 +0100 > summary: xen/x86: re-enable xsave by default now that it supports > live migration. > > I''ve tried booting with no dom0_mem option just to see if that made > any difference. I also tried using the stable tarball of xen-4.1.0, the > current xen-4.1-testing.hg, as well as xen-unstable.hg. All yield > exactly the same results. Any ideas?Can you run gdb on the vmlinux file and find out what instructions and what code caused the failure? The offending piece of code is: <ffffffff819a88df>] ? balloon_init+0x0/0x25e so if you launch gdb: x/20i 0xffffffff819a88df should give an idea what instruction is causing it. And if you have compiled it with CONFIG_DEBUG_INFO (or CONFIG_INFO_DEBUG perhaps?) it should tell you what C code it hit. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-Apr-26 05:03 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 4/25/2011 11:15 PM, Konrad Rzeszutek Wilk wrote:> so if you launch gdb: > > x/20i 0xffffffff819a88df > > should give an idea what instruction is causing it.Reading symbols from /usr/src/linux-2.6-xen/vmlinux...done. (gdb) x/20i 0xffffffff819a88df 0xffffffff819a88df <balloon_init>: push %rbp 0xffffffff819a88e0 <balloon_init+1>: mov $0xffffffed,%eax 0xffffffff819a88e5 <balloon_init+6>: mov %rsp,%rbp 0xffffffff819a88e8 <balloon_init+9>: push %rbx 0xffffffff819a88e9 <balloon_init+10>: sub $0x8,%rsp 0xffffffff819a88ed <balloon_init+14>: cmpl $0x0,0x57a0c(%rip) # 0xffffffff81a00300 0xffffffff819a88f4 <balloon_init+21>: je 0xffffffff819a8b39 <balloon_init+602> 0xffffffff819a88fa <balloon_init+27>: mov 0x11a8a0(%rip),%esi # 0xffffffff81ac31a0 0xffffffff819a8900 <balloon_init+33>: xor %eax,%eax 0xffffffff819a8902 <balloon_init+35>: mov $0xffffffff817ea59c,%rdi 0xffffffff819a8909 <balloon_init+42>: callq 0xffffffff815c229e <printk> 0xffffffff819a890e <balloon_init+47>: mov 0x11a88c(%rip),%ecx # 0xffffffff81ac31a0 0xffffffff819a8914 <balloon_init+53>: mov $0x1,%eax 0xffffffff819a8919 <balloon_init+58>: shl %cl,%eax 0xffffffff819a891b <balloon_init+60>: cmpl $0x1,0x579de(%rip) # 0xffffffff81a00300 0xffffffff819a8922 <balloon_init+67>: cltq 0xffffffff819a8924 <balloon_init+69>: mov %rax,0x11a87d(%rip) # 0xffffffff81ac31a8 0xffffffff819a892b <balloon_init+76>: jne 0xffffffff819a893a <balloon_init+91> 0xffffffff819a892d <balloon_init+78>: mov 0x579d4(%rip),%rax # 0xffffffff81a00308 0xffffffff819a8934 <balloon_init+85>: mov 0x20(%rax),%rax -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-27 20:09 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On Tue, Apr 26, 2011 at 01:03:38AM -0400, Scott Garron wrote:> On 4/25/2011 11:15 PM, Konrad Rzeszutek Wilk wrote: > >so if you launch gdb: > > > > x/20i 0xffffffff819a88dfDuh! I meant this one: [ 0.316665] RIP: e030:[<ffffffff819a8aea>] [<ffffffff819a8aea>] balloon_init+0x20b/0x25e Sorry about that. Can you also run your kernel with ''initcall_debug loglevel=8'' please? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-Apr-27 23:45 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 04/27/2011 04:09 PM, Konrad Rzeszutek Wilk wrote:> Duh! I meant this one: > > [ 0.316665] RIP: e030:[<ffffffff819a8aea>] [<ffffffff819a8aea>] > balloon_init+0x20b/0x25e > > Sorry about that. Can you also run your kernel with ''initcall_debug > loglevel=8'' please?Ok, I''ve put what I came up with here: http://www.pridelands.org/~simba/xen-debug/debugnotes.txt I also added a few pr_info() lines around the offending code to try to get more of a handle of how far it is getting and what it''s working on at the time of failure: ******** diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index a065fda..b5f0650 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -488,10 +488,13 @@ static int __init balloon_init(void) */ extra_pfn_end = min(min(max_pfn, e820_end_of_ram_pfn()), (unsigned long)PFN_DOWN(xen_extra_mem_start + xen_ex + pr_info("extra_pfn_end: 0x%x", extra_pfn_end); /* debug */ for (pfn = PFN_UP(xen_extra_mem_start); pfn < extra_pfn_end; pfn += balloon_npages) { + pr_info("pfn: 0x%x", pfn); /* debug */ page = pfn_to_page(pfn); + pr_info("page: 0x%p", page); /* debug */ /* totalram_pages doesn''t include the boot-time balloon extension, so don''t subtract from it. */ __balloon_append(page); ******** The new serial console output, with "initcall_debug loglevel=8" and the pr_info() additions to the code can be found here: http://www.pridelands.org/~simba/xen-debug/hailstorm-fullserial20110427.txt ... but I''ll paste the part closest to the crash here for your convenience: [ 1.016663] calling balloon_init+0x0/0x280 @ 1 [ 1.016663] xen_balloon: Initialising balloon driver with page order 0. [ 1.033446] last_pfn = 0x1d9ff0 max_arch_pfn = 0x400000000 [ 1.036663] extra_pfn_end: 0x1d9ff0 [ 1.036663] pfn: 0x100000 [ 1.036663] page: 0xffffea0003800000 [ 1.036663] BUG: unable to handle kernel paging request at ffffea0003800028 [ 1.036663] IP: [<ffffffff819a8b1f>] balloon_init+0x240/0x280 [ 1.036663] PGD 18402067 PUD 18403067 PMD 0 So the crash is happening within the first iteration of that for() loop, presumably while calling __balloon_append(page). That''s as far as I dove into it so far, but I figured I''d give you an update as to what I''ve found and tried. Just for more information sake, I also tried booting this kernel as a paravirt domU under the Debian Stable 2.6.32-5-xen-amd64 stock kernel and Xen 4.1.0. It booted without incident (aside from a ridiculously long spew of printk''s from my additions to that for() loop), so the failure is specific to the kernel booting as a dom0. That probably doesn''t narrow down much, but I figured it was noteworthy. -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Apr-28 18:30 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On Wed, Apr 27, 2011 at 07:45:42PM -0400, Scott Garron wrote:> On 04/27/2011 04:09 PM, Konrad Rzeszutek Wilk wrote: > >Duh! I meant this one: > > > >[ 0.316665] RIP: e030:[<ffffffff819a8aea>] [<ffffffff819a8aea>] > >balloon_init+0x20b/0x25e > > > >Sorry about that. Can you also run your kernel with ''initcall_debug > >loglevel=8'' please? > > Ok, I''ve put what I came up with here: > > http://www.pridelands.org/~simba/xen-debug/debugnotes.txt > > I also added a few pr_info() lines around the offending code to try > to get more of a handle of how far it is getting and what it''s working > on at the time of failure:This looks quite odd. We had a flurry of issues like these before were we "forgot" to set the P2M table correctly. So that during [ 0.000000] init_memory_mapping: 0000000100000000-00000001d9ff0000 it would crash b/c for PFNs above the ''dom0_mem'' paramater we would return INVALID value and the machine would crash - but only if the value was not aligned (git commit f06e457cb729d58430d1385014fab367b2d4e7c2) But that isn''t the case here (dom0_mem=512M). And you say it boots fine under DomU - so there is some P2M, E820 funkiness happening here I think. Had you tried booting the kernel as Dom0 with different sizes of dom0_mem ("dom0_mem=max:2GB?") Or without the dom0_mem parameter at all? What is your CONFIG_XEN_MAX_DOMAIN_MEMORY set to?> > ******** > > diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c > index a065fda..b5f0650 100644 > --- a/drivers/xen/balloon.c > +++ b/drivers/xen/balloon.c > @@ -488,10 +488,13 @@ static int __init balloon_init(void) > */ > extra_pfn_end = min(min(max_pfn, e820_end_of_ram_pfn()), > (unsigned long)PFN_DOWN(xen_extra_mem_start > + xen_ex > + pr_info("extra_pfn_end: 0x%x", extra_pfn_end); /* debug */ > for (pfn = PFN_UP(xen_extra_mem_start); > pfn < extra_pfn_end; > pfn += balloon_npages) { > + pr_info("pfn: 0x%x", pfn); /* debug */ > page = pfn_to_page(pfn); > + pr_info("page: 0x%p", page); /* debug */ > /* totalram_pages doesn''t include the boot-time > balloon extension, so don''t subtract from it. */ > __balloon_append(page); > > > ******** > > The new serial console output, with "initcall_debug loglevel=8" and > the pr_info() additions to the code can be found here: > > http://www.pridelands.org/~simba/xen-debug/hailstorm-fullserial20110427.txt > > ... but I''ll paste the part closest to the crash here for your convenience: > > [ 1.016663] calling balloon_init+0x0/0x280 @ 1 > [ 1.016663] xen_balloon: Initialising balloon driver with page order 0. > [ 1.033446] last_pfn = 0x1d9ff0 max_arch_pfn = 0x400000000 > [ 1.036663] extra_pfn_end: 0x1d9ff0 > [ 1.036663] pfn: 0x100000 > [ 1.036663] page: 0xffffea0003800000 > [ 1.036663] BUG: unable to handle kernel paging request at > ffffea0003800028 > [ 1.036663] IP: [<ffffffff819a8b1f>] balloon_init+0x240/0x280 > [ 1.036663] PGD 18402067 PUD 18403067 PMD 0 > > > So the crash is happening within the first iteration of that for() > loop, presumably while calling __balloon_append(page). That''s as far as > I dove into it so far, but I figured I''d give you an update as to what > I''ve found and tried. > > Just for more information sake, I also tried booting this kernel as > a paravirt domU under the Debian Stable 2.6.32-5-xen-amd64 stock kernel > and Xen 4.1.0. It booted without incident (aside from a ridiculously > long spew of printk''s from my additions to that for() loop), so the > failure is specific to the kernel booting as a dom0. That probably > doesn''t narrow down much, but I figured it was noteworthy. > > -- > Scott Garron_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-Apr-29 00:15 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 04/28/2011 02:30 PM, Konrad Rzeszutek Wilk wrote:> And you say it boots fine under DomU - so there is some P2M, E820 > funkiness happening here I think. Had you tried booting the kernel > as Dom0 with different sizes of dom0_mem ("dom0_mem=max:2GB?") Or > without the dom0_mem parameter at all?From my first message to the list in this thread: "I''ve tried booting with no dom0_mem option just to see if that made any difference." The same crash happened with or without the dom0_mem parameter. After trying "dom0_mem=max:2048MB" just now, I get the same crash.> What is your CONFIG_XEN_MAX_DOMAIN_MEMORY set to?It''s currently set to 128. The kernel config I''m using is at http://www.pridelands.org/~simba/xen-debug/hailstorm-kernelconfig.txt Here''s what has me scratching my head, though... This bytecode instruction: 0xffffffff819a8aca <balloon_init+491>: imul $0x38,%rdx,%rcx If I use gdb to point me to the C code for that instruction, it gives me: page = pfn_to_page(pfn); ... from within the for() loop in question. Expanding the macro "pfn_to_page(pfn)", I get: (((struct page *)(0xffffea0000000000UL)) + (pfn)) So, the preprocessed C code should look like: page = (((struct page *)(0xffffea0000000000UL)) + (pfn)); Why would an addition operation in C translate to a multiplication instruction in the bytecode? Moreover, where does multiplying by 0x38 come from? It seems to me that if pfn is 0x100000, and it gets added to 0xffffea0000000000, the end result would be 0xffffea0000100000. The pr_info() that I added shows that "page" is equal to 0xffffea0003800000, indicating that it multiplied pfn by 0x38 before adding it to 0xffffea0000000000 instead of simply adding it. Because the kernel configuration option "CONFIG_SPARSEMEM_VMEMMAP" mentioned pfn_to_page(pfn), one of the things I tried was unsetting it but ended up with the same results (except that "page" was 0x0000000003800000). Then, I suspected a compiler problem, so I tried recompiling with gcc 4.3 instead of 4.5. Same results. Just for kicks, I tried hexediting balloon.o and changing that instruction to "imul $0x1,%rdx,%rcx" (since multiplying by 1 will essentially nullify the instruction), but the end result was still the same crash, even though the value for "page" ended up being 0x0000000000100000. It would seem that my suspicion on that instruction was incorrect, but I''m still having trouble wrapping my mind around why 0x38 is even there. -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-Apr-29 02:12 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
Scott Garron wrote:> Just for kicks, I tried hexediting balloon.o and changing that > instruction to "imul $0x1,%rdx,%rcx" (since multiplying by 1 will > essentially nullify the instruction), but the end result was still > the same crash, even though the value for "page" ended up being > 0x0000000000100000.While still thinking along those lines, I re-enabled CONFIG_SPARSEMEM_VMEMMAP, then hexedited the instruction again, and the kernel got further along in the boot process, but crashed while trying to free the initrd memory. The serial console from that boot is at: http://www.pridelands.org/~simba/xen-debug/hailstorm-fullserial20110428-afterhexedit.txt My deduction so far is that "page = pfn_to_page(pfn);" is somehow returning a location that isn''t quite "correct", but removing the "multipliply by 0x38" instruction only returned something partially usable and it took a dump all over the memory pages. Admittedly, I really know little about how all of this works, so my debugging process is like taking stabs in the dark. It''s somewhat intriguing to me, so I''m pretty much just playing with it until someone who knows more can reproduce it. It''s hard to imagine that I''m the only one having this problem with the current "xen/stable-2.6.32.x" branch. -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2011-Apr-29 14:43 UTC
RE: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
> Scott Garron wrote: > > Just for kicks, I tried hexediting balloon.o and changing that > > instruction to "imul $0x1,%rdx,%rcx" (since multiplying by 1 will > > essentially nullify the instruction), but the end result was still > > the same crash, even though the value for "page" ended up being > > 0x0000000000100000.That multiply is correct. In C, when you add an integer X to a pointer to a struct of size N, the result is the same as if you were accessing the Xth element of an array of those structs. struct foo *pfoo; int X; size_t N; N = sizeof(struct foo); pfoo + X == (unsigned long)pfoo + (N * X) /* is always true */> My deduction so far is that "page = pfn_to_page(pfn);" is somehow > returning a location that isn''t quite "correct", but removing the > "multipliply by 0x38" instruction only returned something partially > usable and it took a dump all over the memory pages. > > Admittedly, I really know little about how all of this works, so > my > debugging process is like taking stabs in the dark. It''s somewhat > intriguing to me, so I''m pretty much just playing with it until someone > who knows more can reproduce it. It''s hard to imagine that I''m the > only > one having this problem with the current "xen/stable-2.6.32.x" branch.A couple thoughts: 1) Is your guest an HVM or PV? IIRC, earlier versions of the balloon driver did not run properly in an HVM guest. Compare your source with a latest upstream balloon_init. 2) Are you building xen/stable-2.6.32.x as the kernel in a guest? Any chance you might be loading a balloon module that doesn''t match the kernel you built? 3) I think developers generally use the xen/stable-2.6.32.x for dom0 and use distro kernels (or newer upstream kernels) for guest kernels. So it is very possible that you are the only one having this problem because you are the only one using a balloon driver on a xen/stable-2.6.32.x kernel in a non-dom0 (HVM?) guest. 4) The latest upstream balloon driver does some magic with the E820 memory map. Perhaps your machine has an odd or incorrect E820 map from the BIOS? (This is outside of my area of expertise so apologies if this doesn''t make sense.) Hope that helps! Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-Apr-29 16:56 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 4/29/2011 10:43 AM, Dan Magenheimer wrote:> That multiply is correct. In C, when you add an integer X to a > pointer to a struct of size N, the result is the same as if you were > accessing the Xth element of an array of those structs.Oh yeah! I completely forgot about that. (Haven''t done any programming for about six years, so ... rusty)> 1) Is your guest an HVM or PV? IIRC, earlier versions of the > balloon driver did not run properly in an HVM guest. Compare your > source with a latest upstream balloon_init.This machine is too old to have the ability to run a guest as HVM (it''s pre-Pacifica/AMD-V), but the problem isn''t with running a domU. It won''t boot into dom0 (privileged domain) due to this bug. Oddly enough, the build in question *does* work as a PV domU when the hardware is booted to the Debian stock linux-image-xen-amd64 kernel (2.6.32-5-xen-amd64).> 2) Are you building xen/stable-2.6.32.x as the kernel in a guest? > Any chance you might be loading a balloon module that doesn''t match > the kernel you built?I''m building the kernel on a separate, faster machine than the one I''m trying to boot it on. I wasn''t aware that where it was being built made much of a difference as long as the configured processor type was correct. Both machines are x86_64, but one is Xeon and the other is Opteron. If it does matter where it''s built, how do distributions put out pre-compiled dom0 kernels? Also, if it does matter, I can try building it on the slower machine. (The faster one is a production server, so I''m not testing all of these reboots on that)> 3) I think developers generally use the xen/stable-2.6.32.x for dom0 > and use distro kernels (or newer upstream kernels) for guest kernels. > So it is very possible that you are the only one having this problem > because you are the only one using a balloon driver on a > xen/stable-2.6.32.x kernel in a non-dom0 (HVM?) guest.I *am* trying to use it in a dom0 guest. It''s: Hardware/BIOS -> Xen4.1 -> this xen/stable-2.6.32.x dom0 -> crash> 4) The latest upstream balloon driver does some magic with the E820 > memory map. Perhaps your machine has an odd or incorrect E820 map > from the BIOS? (This is outside of my area of expertise so apologies > if this doesn''t make sense.)I suppose it could be odd. The machine is about 8 years old. I''d imagine that they were doing things a bit differently back then. -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2011-Apr-29 19:38 UTC
RE: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
(Jeremy cc''ed, see below)> > 4) The latest upstream balloon driver does some magic with the E820 > > memory map. Perhaps your machine has an odd or incorrect E820 map > > from the BIOS? (This is outside of my area of expertise so apologies > > if this doesn''t make sense.) > > I suppose it could be odd. The machine is about 8 years old. > I''d > imagine that they were doing things a bit differently back then.Hi Scott -- Since you are using dom0_mem, there really should be no reason why the balloon driver needs to get initialized. For the balloon driver to work properly, it needs a correct E820 map and there have been recent upstream changes in the balloon driver involving E820. If we assume that your E820 map is indeed broken, the easiest fix for your machine might be just to modify balloon_init in dom0 to immediately fail (and return -ENODEV). Jeremy, I wonder if Scott is experiencing a side-effect of this change? http://lists.xensource.com/archives/html/xen-devel/2010-09/msg00375.html _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-Apr-29 23:08 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 04/29/2011 03:38 PM, Dan Magenheimer wrote:> Since you are using dom0_mem, there really should be no reason why > the balloon driver needs to get initialized. For the balloon driver > to work properly, it needs a correct E820 map and there have been > recent upstream changes in the balloon driver involving E820. If we > assume that your E820 map is indeed broken, the easiest fix for your > machine might be just to modify balloon_init in dom0 to immediately > fail (and return -ENODEV).Returning -ENODEV at the beginning of balloon_init() did allow the machine to boot, but something is definitely still amiss: Trying to start any domUs yields: simba@hailstorm:~$ sudo xm create test.cfg Using config file "/etc/xen/test.cfg". Error: Failed to query current memory allocation of dom0. Oddly enough, xencommons and xend started and xm list and xm info show everything correctly. And this is just bizarre... simba@hailstorm:~$ free total used free shared buffers cached Mem: 552 18014398509158460 324076 0 11428 34656 -/+ buffers/cache: 18014398509112376 370160 Swap: 0 0 0 I just tried grabbing the most recent BIOS revision from Tyan for this motherboard (was running 2882_306, flashed it to 2882_309). Same results. -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-May-04 15:58 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On Fri, Apr 29, 2011 at 07:08:29PM -0400, Scott Garron wrote:> On 04/29/2011 03:38 PM, Dan Magenheimer wrote: > >Since you are using dom0_mem, there really should be no reason why > >the balloon driver needs to get initialized. For the balloon driver > >to work properly, it needs a correct E820 map and there have been > >recent upstream changes in the balloon driver involving E820. If we > >assume that your E820 map is indeed broken, the easiest fix for your > >machine might be just to modify balloon_init in dom0 to immediately > >fail (and return -ENODEV). > > Returning -ENODEV at the beginning of balloon_init() did allow the > machine to boot, but something is definitely still amiss: > > Trying to start any domUs yields: > > simba@hailstorm:~$ sudo xm create test.cfg > Using config file "/etc/xen/test.cfg". > Error: Failed to query current memory allocation of dom0. > > Oddly enough, xencommons and xend started and xm list and xm info > show everything correctly.That is probably b/c ''xm'' tries to touch /sys/.../target_kb and since the balloon code never got to run that sysfs does not exist. So ''xm'' bails out.> > And this is just bizarre... > > simba@hailstorm:~$ free > total used free shared buffers cached > Mem: 552 18014398509158460 324076 0 11428 > 34656 > -/+ buffers/cache: 18014398509112376 370160 > Swap: 0 0 0 > > > I just tried grabbing the most recent BIOS revision from Tyan for > this motherboard (was running 2882_306, flashed it to 2882_309). Same > results.This is quite baffling. The failure you get: [ 0.490020] BUG: unable to handle kernel paging request at 0000000003800028 [ 0.493331] IP: [<ffffffff812d89fa>] __balloon_append+0x3f/0x52 [ 0.493331] PGD 0 Tells us that the entry in the PGD, so in init_level4_pgt[3] is zero instead of having a PFN/MFN in it. But earlier on: [ 0.000000] init_memory_mapping: 0000000100000000-00000001d9ff0000 [ 0.000000] 0100000000 - 01d9ff0000 page 4k [ 0.000000] kernel direct mapping tables up to 1d9ff0000 @ dafa000-e9d3000 It tells us that the pages have been indeed filled up with the right values. But perhaps the values for the entries past the 4G are filled with zero, which might be the case as in (pte_pfn_to_mfn): * If there''s no mfn for the pfn, then just create an 802 * empty non-present pte. Unfortunately this loses 803 * information about the original pfn, so 804 * pte_mfn_to_pfn is asymmetric. 805 */ 806 if (unlikely(mfn == INVALID_P2M_ENTRY)) { 807 mfn = 0; 808 flags = 0; 809 } 810 Which is correct... we should fill those entries with zero as they don''t exist. Hmm.. But in the meantime, try the attached patch (not compile tested). It should give us an idea what is happening with the allocation. The theory is that ''kernel_physical_mapping_init'' in init_64.c when it calls: 563 pgd_populate(&init_mm, pgd, __va(pud_phys)); and then: #if PAGETABLE_LEVELS > 3 111 static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud) 112 { 113 paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT); --> 114 set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud))); <<--- 115 } ends with an set_pgd(pgd, 0) instead of the proper pointer value being programmed.. which would imply that the location (so __pa(pud)) is zero instead of a physical location. Anyhow, try this patch below. Should give us some ideas. Not compile tested. diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index 204e3ba..7a8177c 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -806,6 +806,8 @@ static pteval_t pte_pfn_to_mfn(pteval_t val) if (unlikely(mfn == INVALID_P2M_ENTRY)) { mfn = 0; flags = 0; + printk(KERN_INFO "%s: 0x%lx is INVALID for %llx\n", + ___func__, pfn, (unsigned long)val); } val = ((pteval_t)mfn << PAGE_SHIFT) | flags; @@ -935,6 +937,7 @@ void xen_set_pud_hyper(pud_t *ptr, pud_t val) /* ptr may be ioremapped for 64-bit pagetable setup */ u.ptr = arbitrary_virt_to_machine(ptr).maddr; u.val = pud_val_ma(val); + printk(KERN_INFO "%s: %lx, %lx\n", __func__, u.ptr, u.val); xen_extend_mmu_update(&u); ADD_STATS(pud_update_batched, paravirt_get_lazy_mode() == PARAVIRT_LAZY_MMU); @@ -1889,6 +1892,7 @@ static __init void xen_alloc_pmd_init(struct mm_struct *mm, unsigned long pfn) #ifdef CONFIG_FLATMEM BUG_ON(mem_map); /* should only be used early */ #endif + printk(KERN_INFO "%s: %lx set for PGD/PMD\n", __func__, pfn); make_lowmem_page_readonly(__va(PFN_PHYS(pfn))); } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-May-04 19:19 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 05/04/2011 11:58 AM, Konrad Rzeszutek Wilk wrote:> It tells us that the pages have been indeed filled up with the right values. > But perhaps the values for the entries past the 4G are filled with zero, which > might be the case as in (pte_pfn_to_mfn):Just FYI, the machine has exactly 4G of RAM - 2 on one CPU, 2 on the other (two socket motherboard, single core each).> Anyhow, try this patch below. Should give us some ideas. Not compile tested.> + printk(KERN_INFO "%s: 0x%lx is INVALID for %llx\n", > + ___func__, pfn, (unsigned long)val);The compile complained about __func__ not being declared and exited with an error.> + printk(KERN_INFO "%s: %lx, %lx\n", __func__, u.ptr, u.val);The compile also warned about %lx expecting ''long unsigned int'', but argument 3 and 4 are ''uint64_t''. arch/x86/xen/mmu.c: In function ‘pte_pfn_to_mfn’: arch/x86/xen/mmu.c:810:5: error: ‘___func__’ undeclared (first use in this function) arch/x86/xen/mmu.c:810:5: note: each undeclared identifier is reported only once for each function it appears in arch/x86/xen/mmu.c: In function ‘xen_set_pud_hyper’: arch/x86/xen/mmu.c:940:2: warning: format ‘%lx’ expects type ‘long unsigned int’, but argument 3 has type ‘uint64_t’ arch/x86/xen/mmu.c:940:2: warning: format ‘%lx’ expects type ‘long unsigned int’, but argument 4 has type ‘uint64_t’ make[2]: *** [arch/x86/xen/mmu.o] Error 1 make[1]: *** [arch/x86/xen] Error 2 make: *** [arch/x86] Error 2 make: *** Waiting for unfinished jobs.... -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-May-04 19:35 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On Wed, May 04, 2011 at 03:19:08PM -0400, Scott Garron wrote:> On 05/04/2011 11:58 AM, Konrad Rzeszutek Wilk wrote: > >It tells us that the pages have been indeed filled up with the right values. > >But perhaps the values for the entries past the 4G are filled with zero, which > >might be the case as in (pte_pfn_to_mfn): > > Just FYI, the machine has exactly 4G of RAM - 2 on one CPU, 2 on > the other (two socket motherboard, single core each). > > >Anyhow, try this patch below. Should give us some ideas. Not compile tested.^^^^^^^^^^ The __func__ can be just __FUNCTION__ and the reset can be ignored (I think)> > >+ printk(KERN_INFO "%s: 0x%lx is INVALID for %llx\n", > >+ ___func__, pfn, (unsigned long)val); > > The compile complained about __func__ not being declared and > exited with an error. > > >+ printk(KERN_INFO "%s: %lx, %lx\n", __func__, u.ptr, u.val); > > The compile also warned about %lx expecting ''long unsigned int'', > but argument 3 and 4 are ''uint64_t''. > > arch/x86/xen/mmu.c: In function ‘pte_pfn_to_mfn’: > arch/x86/xen/mmu.c:810:5: error: ‘___func__’ undeclared (first use > in this function) > arch/x86/xen/mmu.c:810:5: note: each undeclared identifier is > reported only once for each function it appears in > arch/x86/xen/mmu.c: In function ‘xen_set_pud_hyper’: > arch/x86/xen/mmu.c:940:2: warning: format ‘%lx’ expects type ‘long > unsigned int’, but argument 3 has type ‘uint64_t’ > arch/x86/xen/mmu.c:940:2: warning: format ‘%lx’ expects type ‘long > unsigned int’, but argument 4 has type ‘uint64_t’ > make[2]: *** [arch/x86/xen/mmu.o] Error 1 > make[1]: *** [arch/x86/xen] Error 2 > make: *** [arch/x86] Error 2 > make: *** Waiting for unfinished jobs.... > > > -- > Scott Garron > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-May-04 20:17 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 05/04/2011 03:35 PM, Konrad Rzeszutek Wilk wrote:>>> [..] Should give us some ideas. Not compile tested. > ................................. ^^^^^^^^^^I was tracking down why it wouldn''t compile, but fired off an e-mail because I figured that if at least 2 of us were working on it, we''d figure it out faster.> The __func__ can be just __FUNCTION__ and the reset can be ignored (I > think)It turned out that there were three underscores in front of "func" instead of two in that one line. I removed one of the underscores, got it to compile, and the resulting messages at boot time are a seemingly neverending spew of: [ 0.000000] pte_pfn_to_mfn: 0x110e01 is INVALID for 8000000110e01063 ... where 0x110e01 and 8000000110e01063 are ever-increasing values. The 8000000110e01063 number always ends in "63", though. Here''s the serial console output from the time Xen starts the Linux kernel until the spew of repetitive messages: [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Linux version 2.6.32.39 (root@hailstorm) (gcc version 4.5.2 (Debian 4.5.2-8) ) #26 SMP Wed May 4 15:4 7:34 EDT 2011 [ 0.000000] Command line: placeholder root=UUID=be3507b9-015f-4ac6-9b4a-914a9c774421 ro console=hvc0 earlyprintkxen nomodeset initcall_debug loglevel=8 [ 0.000000] KERNEL supported cpus: [ 0.000000] Intel GenuineIntel [ 0.000000] AMD AuthenticAMD [ 0.000000] Centaur CentaurHauls [ 0.000000] released 0 pages of unused memory [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] Xen: 0000000000000000 - 000000000009f000 (usable) [ 0.000000] Xen: 000000000009f400 - 0000000000100000 (reserved) [ 0.000000] Xen: 0000000000100000 - 0000000020000000 (usable) [ 0.000000] Xen: 0000000020000000 - 00000000f9ff0000 (unusable) [ 0.000000] Xen: 00000000f9ff0000 - 00000000f9fff000 (ACPI data) [ 0.000000] Xen: 00000000f9fff000 - 00000000fa000000 (ACPI NVS) [ 0.000000] Xen: 00000000febfe000 - 00000000fec01000 (reserved) [ 0.000000] Xen: 00000000fee00000 - 00000000fee01000 (reserved) [ 0.000000] Xen: 00000000ff780000 - 0000000100000000 (reserved) [ 0.000000] Xen: 0000000100000000 - 00000001d9ff0000 (usable) [ 0.000000] bootconsole [xenboot0] enabled [ 0.000000] DMI 2.3 present. [ 0.000000] AMI BIOS detected: BIOS may corrupt low RAM, working around it. [ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved) [ 0.000000] last_pfn = 0x1d9ff0 max_arch_pfn = 0x400000000 [ 0.000000] x86 PAT enabled: cpu 0, old 0x50100070406, new 0x7010600070106 [ 0.000000] last_pfn = 0x20000 max_arch_pfn = 0x400000000 [ 0.000000] Scanning 0 areas for low memory corruption [ 0.000000] modified physical RAM map: [ 0.000000] modified: 0000000000000000 - 0000000000010000 (reserved) [ 0.000000] modified: 0000000000010000 - 000000000009f000 (usable) [ 0.000000] modified: 000000000009f400 - 0000000000100000 (reserved) [ 0.000000] modified: 0000000000100000 - 0000000020000000 (usable) [ 0.000000] modified: 0000000020000000 - 00000000f9ff0000 (unusable) [ 0.000000] modified: 00000000f9ff0000 - 00000000f9fff000 (ACPI data) [ 0.000000] modified: 00000000f9fff000 - 00000000fa000000 (ACPI NVS) [ 0.000000] modified: 00000000febfe000 - 00000000fec01000 (reserved) [ 0.000000] modified: 00000000fee00000 - 00000000fee01000 (reserved) [ 0.000000] modified: 00000000ff780000 - 0000000100000000 (reserved) [ 0.000000] modified: 0000000100000000 - 00000001d9ff0000 (usable) [ 0.000000] initial memory mapped : 0 - 0d919000 [ 0.000000] init_memory_mapping: 0000000000000000-0000000020000000 [ 0.000000] 0000000000 - 0020000000 page 4k [ 0.000000] kernel direct mapping tables up to 20000000 @ 100000-202000 [ 0.000000] init_memory_mapping: 0000000100000000-00000001d9ff0000 [ 0.000000] 0100000000 - 01d9ff0000 page 4k [ 0.000000] kernel direct mapping tables up to 1d9ff0000 @ da8b000-e964000 [ 0.000000] pte_pfn_to_mfn: 0x100000 is INVALID for 8000000100000063 [ 0.000000] pte_pfn_to_mfn: 0x100001 is INVALID for 8000000100001063 [ 0.000000] pte_pfn_to_mfn: 0x100002 is INVALID for 8000000100002063 [ 0.000000] pte_pfn_to_mfn: 0x100003 is INVALID for 8000000100003063 [ .. and so on - if it ever ends, I''ll send you the tail of it .. ] -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-May-04 20:23 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On Wed, May 04, 2011 at 04:17:51PM -0400, Scott Garron wrote:> On 05/04/2011 03:35 PM, Konrad Rzeszutek Wilk wrote: > >>>[..] Should give us some ideas. Not compile tested. > >................................. ^^^^^^^^^^ > > I was tracking down why it wouldn''t compile, but fired off an > e-mail because I figured that if at least 2 of us were working on it, > we''d figure it out faster. > > >The __func__ can be just __FUNCTION__ and the reset can be ignored (I > >think) > > It turned out that there were three underscores in front of "func" > instead of two in that one line. I removed one of the underscores, got > it to compile, and the resulting messages at boot time are a seemingly > neverending spew of: > > [ 0.000000] pte_pfn_to_mfn: 0x110e01 is INVALID for 8000000110e01063 > > ... where 0x110e01 and 8000000110e01063 are ever-increasing values. > The 8000000110e01063 number always ends in "63", though.Ok, maybe take out the debug for the pte_pfn_to_mfn. And see what the pgd debug gives us.> > Here''s the serial console output from the time Xen starts the Linux > kernel until the spew of repetitive messages: > > [ 0.000000] Initializing cgroup subsys cpuset > [ 0.000000] Initializing cgroup subsys cpu > [ 0.000000] Linux version 2.6.32.39 (root@hailstorm) (gcc version > 4.5.2 (Debian 4.5.2-8) ) #26 SMP Wed May 4 15:4 > 7:34 EDT 2011 > [ 0.000000] Command line: placeholder > root=UUID=be3507b9-015f-4ac6-9b4a-914a9c774421 ro console=hvc0 > earlyprintk> xen nomodeset initcall_debug loglevel=8 > [ 0.000000] KERNEL supported cpus: > [ 0.000000] Intel GenuineIntel > [ 0.000000] AMD AuthenticAMD > [ 0.000000] Centaur CentaurHauls > [ 0.000000] released 0 pages of unused memory > [ 0.000000] BIOS-provided physical RAM map: > [ 0.000000] Xen: 0000000000000000 - 000000000009f000 (usable) > [ 0.000000] Xen: 000000000009f400 - 0000000000100000 (reserved) > [ 0.000000] Xen: 0000000000100000 - 0000000020000000 (usable) > [ 0.000000] Xen: 0000000020000000 - 00000000f9ff0000 (unusable) > [ 0.000000] Xen: 00000000f9ff0000 - 00000000f9fff000 (ACPI data) > [ 0.000000] Xen: 00000000f9fff000 - 00000000fa000000 (ACPI NVS) > [ 0.000000] Xen: 00000000febfe000 - 00000000fec01000 (reserved) > [ 0.000000] Xen: 00000000fee00000 - 00000000fee01000 (reserved) > [ 0.000000] Xen: 00000000ff780000 - 0000000100000000 (reserved) > [ 0.000000] Xen: 0000000100000000 - 00000001d9ff0000 (usable) > [ 0.000000] bootconsole [xenboot0] enabled > [ 0.000000] DMI 2.3 present. > [ 0.000000] AMI BIOS detected: BIOS may corrupt low RAM, working > around it. > [ 0.000000] e820 update range: 0000000000000000 - > 0000000000010000 (usable) ==> (reserved) > [ 0.000000] last_pfn = 0x1d9ff0 max_arch_pfn = 0x400000000 > [ 0.000000] x86 PAT enabled: cpu 0, old 0x50100070406, new > 0x7010600070106 > [ 0.000000] last_pfn = 0x20000 max_arch_pfn = 0x400000000 > [ 0.000000] Scanning 0 areas for low memory corruption > [ 0.000000] modified physical RAM map: > [ 0.000000] modified: 0000000000000000 - 0000000000010000 (reserved) > [ 0.000000] modified: 0000000000010000 - 000000000009f000 (usable) > [ 0.000000] modified: 000000000009f400 - 0000000000100000 (reserved) > [ 0.000000] modified: 0000000000100000 - 0000000020000000 (usable) > [ 0.000000] modified: 0000000020000000 - 00000000f9ff0000 (unusable) > [ 0.000000] modified: 00000000f9ff0000 - 00000000f9fff000 (ACPI data) > [ 0.000000] modified: 00000000f9fff000 - 00000000fa000000 (ACPI NVS) > [ 0.000000] modified: 00000000febfe000 - 00000000fec01000 (reserved) > [ 0.000000] modified: 00000000fee00000 - 00000000fee01000 (reserved) > [ 0.000000] modified: 00000000ff780000 - 0000000100000000 (reserved) > [ 0.000000] modified: 0000000100000000 - 00000001d9ff0000 (usable) > [ 0.000000] initial memory mapped : 0 - 0d919000 > [ 0.000000] init_memory_mapping: 0000000000000000-0000000020000000 > [ 0.000000] 0000000000 - 0020000000 page 4k > [ 0.000000] kernel direct mapping tables up to 20000000 @ 100000-202000 > [ 0.000000] init_memory_mapping: 0000000100000000-00000001d9ff0000 > [ 0.000000] 0100000000 - 01d9ff0000 page 4k > [ 0.000000] kernel direct mapping tables up to 1d9ff0000 @ > da8b000-e964000 > [ 0.000000] pte_pfn_to_mfn: 0x100000 is INVALID for 8000000100000063 > [ 0.000000] pte_pfn_to_mfn: 0x100001 is INVALID for 8000000100001063 > [ 0.000000] pte_pfn_to_mfn: 0x100002 is INVALID for 8000000100002063 > [ 0.000000] pte_pfn_to_mfn: 0x100003 is INVALID for 8000000100003063 > > > [ .. and so on - if it ever ends, I''ll send you the tail of it .. ] > > -- > Scott Garron_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-May-04 21:55 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 05/04/2011 04:23 PM, Konrad Rzeszutek Wilk wrote:> Ok, maybe take out the debug for the pte_pfn_to_mfn. And see what > the pgd debug gives us.I ended up just letting it complete on its own, but will remove that debug printk() for future tests. Keep in mind that I still have "return -ENODEV;" at the beginning of balloon_init() for the kernel used during this boot. The full serial console output, minus about 61 megabytes of the pte_pfn_to_mfn messages trimmed from it, is at: http://www.pridelands.org/~simba/xen-debug/hailstorm-fullserial20110504.txt -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-May-04 22:16 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On Wed, May 04, 2011 at 05:55:02PM -0400, Scott Garron wrote:> On 05/04/2011 04:23 PM, Konrad Rzeszutek Wilk wrote: > >Ok, maybe take out the debug for the pte_pfn_to_mfn. And see what > >the pgd debug gives us. > > I ended up just letting it complete on its own, but will remove > that debug printk() for future tests. Keep in mind that I still have > "return -ENODEV;" at the beginning of balloon_init() for the kernel used > during this boot. The full serial console output, minus about 61 > megabytes of the pte_pfn_to_mfn messages trimmed from it, is at: > > http://www.pridelands.org/~simba/xen-debug/hailstorm-fullserial20110504.txtOK, will take a look at this tomorrow. BTW, what machine is this? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-May-04 23:23 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 05/04/2011 06:16 PM, Konrad Rzeszutek Wilk wrote:> OK, will take a look at this tomorrow. BTW, what machine is this?The dmidecode output is at: http://www.pridelands.org/~simba/xen-debug/hailstorm-dmidecode.txt Here''s a description of the components in it and how they''re installed: Tyan Thunder K8S Pro S2882 motherboard (purchased sometime around April, 2004) BIOS version 3.09 2 AMD Opteron 240 1.4GHz (CPUs - purchased sometime around April, 2004) 4 Mushkin 991143 1GB Memory Modules PPRO3200 ECC REG (4GB RAM - purchased sometime in 2004 or 2005) Emphase FDM80SQI2G 2GB IDE Flash drive - serial: 20110112AA5B00000041 (boot device - purchased 2011-04-05, installed 2011-05-02) Western Digital Caviar 500G WD5001AALS Hard Drive Western Digital Caviar 500G WD5001AALS Hard Drive Neither the motherboard, nor the CPUs are "new" enough to have hardware virtual machine support. Two of the memory modules are plugged into the slots for CPU0 and two are plugged into the slots for CPU1. The Emphase flash drive is plugged directly into the Primary IDE/PATA connector on the motherboard and set as Master. It contains GRUB in the MBR, a partition with an ext4 Debian Linux root filesystem, and a small, bootable FAT32 MS-DOS partition for the motherboard BIOS flash utility. The WD Caviar drives are plugged directly into the SATA ports on the motherboard and are mirrored by the kernel as /dev/md0. That mirror has a partition set as a physical volume for LVM. LVM allocates logical volumes from that space for: /home /var /usr/share /usr/src -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-May-05 18:34 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On Wed, May 04, 2011 at 07:23:21PM -0400, Scott Garron wrote:> On 05/04/2011 06:16 PM, Konrad Rzeszutek Wilk wrote: > >OK, will take a look at this tomorrow. BTW, what machine is this?> Tyan Thunder K8S Pro S2882 motherboard (purchased sometime around > April, 2004) BIOS version 3.09OK, nothing really strange about it. I don''t have any hardware setup that would mirror yours, so I tried to run under QEMU to have a similar E820. I got this (XEN) Xen-e820 RAM map: (XEN) 0000000000000000 - 000000000009f400 (usable) (XEN) 000000000009f400 - 00000000000a0000 (reserved) (XEN) 00000000000f0000 - 0000000000100000 (reserved) (XEN) 0000000000100000 - 00000000dabfd000 (usable) (XEN) 00000000dabfd000 - 00000000dac00000 (reserved) (XEN) 00000000fffc0000 - 0000000100000000 (reserved) (XEN) System RAM: 3499MB (3583600kB) .. snip.. [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] Xen: 0000000000000000 - 000000000009f000 (usable) [ 0.000000] Xen: 000000000009f400 - 0000000000100000 (reserved) [ 0.000000] Xen: 0000000000100000 - 00000000d0167000 (usable) [ 0.000000] Xen: 00000000d0167000 - 00000000dabfd000 (unusable) [ 0.000000] Xen: 00000000dabfd000 - 00000000dac00000 (reserved) [ 0.000000] Xen: 00000000fec00000 - 00000000fec01000 (reserved) [ 0.000000] Xen: 00000000fee00000 - 00000000fee01000 (reserved) [ 0.000000] Xen: 00000000fffc0000 - 0000000100000000 (reserved) [ 0.000000] Xen: 0000000100000000 - 000000010aa96000 (usable) which is close enough to be similar to your system. It''s E820 stops right at the 4GB mark, and then when Linux boots it considers that whole region as balloon space. And nothing.. I did not get any of those errors you got. Let me play with this some more. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-May-05 20:48 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 05/05/2011 02:34 PM, Konrad Rzeszutek Wilk wrote:> I don''t have any hardware setup that would mirror yoursI have an identical set of hardware that I could ship to you. :) I could also set you up with remote access to the serial console on the current one, if that would help. -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-May-05 21:06 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On Thu, May 05, 2011 at 04:48:36PM -0400, Scott Garron wrote:> On 05/05/2011 02:34 PM, Konrad Rzeszutek Wilk wrote: > >I don''t have any hardware setup that would mirror yours > > I have an identical set of hardware that I could ship to you. :)<grins>> > I could also set you up with remote access to the serial console on > the current one, if that would help.Before we go that route, let me ask you to try the latest Linux kernel with the Xen changes: git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/next-2.6.39 The http://wiki.xensource.com/xenwiki/XenParavirtOps mentions how to compile and download that tree. Look in ''Downloading the git tree'' section. Lets see how that one works - and if it _does_ work, then it looks like there some patches that need to be back-ported to 2.6.32 tree. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-Jun-06 18:00 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 05/05/2011 05:06 PM, Konrad Rzeszutek Wilk wrote:> Before we go that route, let me ask you to try the latest Linux > kernel with the Xen changes: > > git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git > devel/next-2.6.39I finally have had enough time to get around to trying this. The 2.6.39+ kernel from that git branch/repository gets past the "xen/balloon: Initialising balloon driver" part, but freezes at "Trying to unpack root image as initramfs...". If I boot this kernel without Xen, it boots fine. Since it''s not printing a crash/oops message for this failure, I don''t know if that''s a different problem or it is somehow related, but either way, the outcome is still a machine that doesn''t boot into a current Linux kernel while Xen is under the hood. -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Pasi Kärkkäinen
2011-Jun-06 19:17 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On Mon, Jun 06, 2011 at 02:00:19PM -0400, Scott Garron wrote:> On 05/05/2011 05:06 PM, Konrad Rzeszutek Wilk wrote: >> Before we go that route, let me ask you to try the latest Linux >> kernel with the Xen changes: >> >> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git >> devel/next-2.6.39 > > I finally have had enough time to get around to trying this. The > 2.6.39+ kernel from that git branch/repository gets past the > "xen/balloon: Initialising balloon driver" part, but freezes at "Trying > to unpack root image as initramfs...". >Did you try konrad/xen.git xen/stable-2.6.39.x branch? -- Pasi> If I boot this kernel without Xen, it boots fine. > > Since it''s not printing a crash/oops message for this failure, I > don''t know if that''s a different problem or it is somehow related, but > either way, the outcome is still a machine that doesn''t boot into a > current Linux kernel while Xen is under the hood. > > -- > Scott Garron > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-Jun-06 21:33 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 06/06/2011 03:17 PM, Pasi Kärkkäinen wrote:> Did you try konrad/xen.git xen/stable-2.6.39.x branch?I hadn''t until you suggested it and did just now. That one crashes in a different spot. The full serial output is here: http://www.pridelands.org/~simba/xen-debug/hailstorm-fullserial20110606.txt I''ve also copied the crash part here: [ 2.006634] cfg80211: Calling CRDA to update world regulatory domain [ 2.013339] initcall rfkill_init+0x0/0x97 returned 0 after 6509 usecs [ 2.016536] calling sysctl_init+0x0/0x48 @ 1 [ 2.016536] initcall sysctl_init+0x0/0x48 returned 0 after 0 usecs [ 2.025656] ACPI Warning: Large Reference Count (0x1FEA) in object ffff88001eaab398 (20110316/utdelete-448) [ 2.025656] ACPI Warning: Large Reference Count (0x1FE9) in object ffff88001eaab398 (20110316/utdelete-448) [ 2.025656] BUG: unable to handle kernel NULL pointer dereference at (null) [ 2.025656] IP: [<ffffffff8105ae4c>] process_one_work+0x27/0x286 [ 2.025656] PGD 0 [ 2.025656] Oops: 0000 [#1] SMP [ 2.025656] CPU 0 [ 2.025656] Modules linked in: [ 2.025656] [ 2.025656] Pid: 374, comm: kworker/0:1 Tainted: G W 2.6.39+ #1 To Be Filled By O.E.M. To Be Filled By O.E.M./TYAN High-End Dual AMD Opteron, S2882 [ 2.025656] RIP: e030:[<ffffffff8105ae4c>] [<ffffffff8105ae4c>] process_one_work+0x27/0x286 [ 2.025656] RSP: e02b:ffff880016c05e40 EFLAGS: 00010046 [ 2.025656] RAX: 0000000000000000 RBX: ffff88001eb19c00 RCX: 0000000000000000 [ 2.025656] RDX: ffff88001eaab390 RSI: ffff88001eaab390 RDI: ffff88001eaab390 [ 2.025656] RBP: ffff880016c05e90 R08: ffff88001f802700 R09: ffffffff812a1ef8 [ 2.025656] R10: ffff880000000000 R11: 0000000000000000 R12: ffff88001fea1040 [ 2.025656] R13: 0000000000000000 R14: ffff88001eb91d80 R15: ffff88001eb19c20 [ 2.025656] FS: 0000000000000000(0000) GS:ffff88001fe92000(0000) knlGS:0000000000000000 [ 2.025656] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [ 2.025656] CR2: 0000000000000000 CR3: 0000000001c03000 CR4: 0000000000000660 [ 2.025656] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2.025656] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 2.025656] Process kworker/0:1 (pid: 374, threadinfo ffff880016c04000, task ffff88001eb91d80) [ 2.025656] Stack: [ 2.025656] ffff88001eb19c20 ffff88001eb91d80 000000001eb19c20 ffff88001eaab390 [ 2.025656] ffff880016c05e90 ffff88001eb19c00 ffff88001fea1040 ffff88001eb19c20 [ 2.025656] ffff88001eb91d80 ffff88001eb19c20 ffff880016c05ee0 ffffffff8105c1d8 [ 2.025656] Call Trace: [ 2.025656] [<ffffffff8105c1d8>] worker_thread+0xfe/0x182 [ 2.025656] [<ffffffff816114a4>] ? _raw_spin_unlock_irqrestore+0x15/0x18 [ 2.025656] [<ffffffff8105c0da>] ? manage_workers.clone.17+0x16d/0x16d [ 2.025656] [<ffffffff8105f5bd>] kthread+0x7d/0x85 [ 2.025656] [<ffffffff81618564>] kernel_thread_helper+0x4/0x10 [ 2.025656] [<ffffffff81617973>] ? int_ret_from_sys_call+0x7/0x1b [ 2.025656] [<ffffffff816117a1>] ? retint_restore_args+0x5/0x6 [ 2.025656] [<ffffffff81618560>] ? gs_change+0x13/0x13 [ 2.025656] Code: 41 5c c9 c3 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 89 fb 48 89 f7 48 83 ec 28 48 89 75 c8 e8 9a e9 ff ff 48 8b 55 c8 49 89 c5 <4c> 8b 20 48 8b 45 c8 48 c1 ea 0b 48 c1 e8 05 48 8d 04 02 49 8b [ 2.025656] RIP [<ffffffff8105ae4c>] process_one_work+0x27/0x286 [ 2.025656] RSP <ffff880016c05e40> [ 2.025656] CR2: 0000000000000000 [ 2.025656] ---[ end trace 4eaa2a86a8e2da23 ]--- [ 2.026566] BUG: unable to handle kernel paging request at fffffffffffffff8 [ 2.026575] IP: [<ffffffff8105f80a>] kthread_data+0xb/0x11 [ 2.026585] PGD 1c05067 PUD 1c06067 PMD 0 [ 2.026596] Oops: 0000 [#2] SMP [ 2.026603] CPU 0 [ 2.026607] Modules linked in: [ 2.026615] [ 2.026620] Pid: 374, comm: kworker/0:1 Tainted: G D W 2.6.39+ #1 To Be Filled By O.E.M. To Be Filled By O.E.M./TYAN High-End Dual AMD Opteron, S2882 [ 2.026633] RIP: e030:[<ffffffff8105f80a>] [<ffffffff8105f80a>] kthread_data+0xb/0x11 [ 2.026643] RSP: e02b:ffff880016c05a28 EFLAGS: 00010092 [ 2.026648] RAX: 0000000000000000 RBX: ffff88001fea5080 RCX: 0000000000000000 [ 2.026654] RDX: ffff88001eb92314 RSI: 0000000000000000 RDI: ffff88001eb91d80 [ 2.026659] RBP: ffff880016c05a28 R08: 0000000000100000 R09: 0000000078bd0ee9 [ 2.026665] R10: ffff880016c05a85 R11: ffff880016c05a08 R12: 0000000000000000 [ 2.026670] R13: ffff880016c05b48 R14: 0000000000000000 R15: ffff88001eb92110 [ 2.026679] FS: 0000000000000000(0000) GS:ffff88001fe92000(0000) knlGS:0000000000000000 [ 2.026685] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [ 2.026690] CR2: fffffffffffffff8 CR3: 0000000001c03000 CR4: 0000000000000660 [ 2.026696] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2.026705] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 2.026711] Process kworker/0:1 (pid: 374, threadinfo ffff880016c04000, task ffff88001eb91d80) [ 2.026715] Stack: [ 2.026719] ffff880016c05a58 ffffffff8105c51a ffff880016c05a58 ffff88001fea5080 [ 2.026732] ffff88001eb92314 ffff880016c05b48 ffff880016c05b18 ffffffff8160fc6e [ 2.026744] 0000000000000001 00000002b15b581d 0000000000013080 0000000000013080 [ 2.026756] Call Trace: [ 2.026764] [<ffffffff8105c51a>] wq_worker_sleeping+0x15/0x7d [ 2.026775] [<ffffffff8160fc6e>] schedule+0x160/0x6b4 [ 2.026790] [<ffffffff81049037>] do_exit+0x785/0x787 [ 2.026799] [<ffffffff81046d60>] ? kmsg_dump+0x46/0xd2 [ 2.026807] [<ffffffff816124e1>] oops_end+0xd7/0xdf [ 2.026818] [<ffffffff8102f234>] no_context+0x1f4/0x203 [ 2.026829] [<ffffffff812a2207>] ? acpi_os_vprintf+0x2b/0x2d [ 2.026837] [<ffffffff812a2245>] ? acpi_os_printf+0x3c/0x3e [ 2.026846] [<ffffffff8102f3d0>] __bad_area_nosemaphore+0x18d/0x1b0 [ 2.026857] [<ffffffff81028c4e>] ? pvclock_clocksource_read+0x48/0xb1 [ 2.026866] [<ffffffff8102f401>] bad_area_nosemaphore+0xe/0x10 [ 2.026877] [<ffffffff816142ea>] do_page_fault+0x18e/0x347 [ 2.026885] [<ffffffff81611c4a>] ? error_exit+0x2a/0x60 [ 2.026893] [<ffffffff816117a1>] ? retint_restore_args+0x5/0x6 [ 2.026902] [<ffffffff812a1ef8>] ? acpi_os_execute_deferred+0x2c/0x31 [ 2.026910] [<ffffffff812a1e4c>] ? acpi_os_write_port+0xe/0x2b [ 2.026922] [<ffffffff812ba8ae>] ? acpi_hw_write_port+0x3e/0x94 [ 2.026931] [<ffffffff81028c4e>] ? pvclock_clocksource_read+0x48/0xb1 [ 2.026939] [<ffffffff81611a15>] page_fault+0x25/0x30 [ 2.026947] [<ffffffff812a1ef8>] ? acpi_os_execute_deferred+0x2c/0x31 [ 2.026957] [<ffffffff8105ae4c>] ? process_one_work+0x27/0x286 [ 2.026968] [<ffffffff8105c1d8>] worker_thread+0xfe/0x182 [ 2.026976] [<ffffffff816114a4>] ? _raw_spin_unlock_irqrestore+0x15/0x18 [ 2.026985] [<ffffffff8105c0da>] ? manage_workers.clone.17+0x16d/0x16d [ 2.026993] [<ffffffff8105f5bd>] kthread+0x7d/0x85 [ 2.027001] [<ffffffff81618564>] kernel_thread_helper+0x4/0x10 [ 2.027010] [<ffffffff81617973>] ? int_ret_from_sys_call+0x7/0x1b [ 2.027019] [<ffffffff816117a1>] ? retint_restore_args+0x5/0x6 [ 2.027027] [<ffffffff81618560>] ? gs_change+0x13/0x13 [ 2.027031] Code: ff 48 c7 c6 18 54 80 81 e8 08 c8 fd ff 48 8b 85 68 ff ff ff 48 81 c4 a0 00 00 00 5b 41 5c c9 c3 48 8b 87 38 03 00 00 55 48 89 e5 [ 2.027119] 8b 40 f8 c9 c3 48 3b 3d a9 93 d5 00 55 48 89 e5 75 09 0f bf [ 2.027163] RIP [<ffffffff8105f80a>] kthread_data+0xb/0x11 [ 2.027172] RSP <ffff880016c05a28> [ 2.027176] CR2: fffffffffffffff8 [ 2.027182] ---[ end trace 4eaa2a86a8e2da24 ]--- [ 2.027187] Fixing recursive fault but reboot is needed! -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Jun-07 19:19 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On Mon, Jun 06, 2011 at 05:33:02PM -0400, Scott Garron wrote:> On 06/06/2011 03:17 PM, Pasi Kärkkäinen wrote: > >Did you try konrad/xen.git xen/stable-2.6.39.x branch? > > I hadn''t until you suggested it and did just now. That one crashes > in a different spot. The full serial output is here: > > http://www.pridelands.org/~simba/xen-debug/hailstorm-fullserial20110606.txt > [ 2.026893] [<ffffffff816117a1>] ? retint_restore_args+0x5/0x6...> [ 2.026902] [<ffffffff812a1ef8>] ? acpi_os_execute_deferred+0x2c/0x31 > [ 2.026910] [<ffffffff812a1e4c>] ? acpi_os_write_port+0xe/0x2b > [ 2.026922] [<ffffffff812ba8ae>] ? acpi_hw_write_port+0x3e/0x94 > [ 2.026931] [<ffffffff81028c4e>] ? pvclock_clocksource_read+0x48/0xb1 > [ 2.026939] [<ffffffff81611a15>] page_fault+0x25/0x30 > [ 2.026947] [<ffffffff812a1ef8>] ? acpi_os_execute_deferred+0x2c/0x31That looks very similar to some previous bugs where the ACPI DSDT was trying to write to the IOAPIC to set pins or such. Hmm, the way we "fixed" it was looking at the disassembled DSDT and removing the offending write. You can look in the archive for ACPI DSDT and look for emails from Jan Beulich - he found the culprit. But this might be also something else. Did you try to boot with ''acpi=off'' just to see if ti gets past this. It might hang on something else with that option sadly :-( _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-Jun-08 18:25 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 06/07/2011 03:19 PM, Konrad Rzeszutek Wilk wrote:> But this might be also something else. Did you try to boot with > ''acpi=off'' just to see if ti gets past this. It might hang on > something else with that option sadly :-(Booting with acpi=off in this branch (xen/stable-2.6.39.x) yields the same results as omitting that option while booting the devel/next-2.6.39 branch: It gets to "Trying to unpack rootfs image as initramfs..." and freezes. The full output of the serial console during that boot (with acpi=off) is here: http://pridelands.org/~simba/xen-debug/hailstorm-fullserial20110608.txt -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Jun-08 19:29 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On Wed, Jun 08, 2011 at 02:25:03PM -0400, Scott Garron wrote:> On 06/07/2011 03:19 PM, Konrad Rzeszutek Wilk wrote: > >But this might be also something else. Did you try to boot with > >''acpi=off'' just to see if ti gets past this. It might hang on > >something else with that option sadly :-( > > Booting with acpi=off in this branch (xen/stable-2.6.39.x) yields > the same results as omitting that option while booting the > devel/next-2.6.39 branch: It gets to "Trying to unpack rootfs image as > initramfs..." and freezes. The full output of the serial console duringLooking at your output you have this: (XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) (XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) which means that the ACPI IRQ is edge low. Normaly (this is one of my machines) has this: (XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) (XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) which ends up calling the override with: xen: sci override: global_irq=9 trigger=0 polarity=0 (the 0 trigger is high, polarity 0 is level). on your box it should be something like this: xen: sci override: global_irq=9 trigger=1 polarity=1 since you have edge and low for the ACPI SCI IRQ. And if that is truly your setup (your ACPI SCI IRQ is routed to the pin 0 on the IOAPIC - in other words on IRQ 0), then we bail out in setting it up since we do this check: if (!gsi) return and we never call xen_register_gsi which sets the GSI.. but looking at this: http://pridelands.org/~simba/xen-debug/hailstorm-fullserial20110606.txt It looks as if you are registering the ACPI SCI to 9, but there is no interrupt service override to 9 - did you add that in the code yourself? Can you do (while not having acpi=off) and using ''apic=debug'' on you Linux line: 1). Run Ctrl-A couple of times and hit the ''*'' and send the output. I am really curious to see what the IOAPIC thinks about the interrupts. 2). Increase the dom0_mem= to say 1G? 3). Try the attached patch (not compile tested) Do all of those at once.. Lets concentrate on running this with ACPI and see what we get. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-Jun-09 20:04 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 06/08/2011 03:29 PM, Konrad Rzeszutek Wilk wrote:> Can you do (while not having acpi=off) and using ''apic=debug'' on you > Linux line: > > 1). Run Ctrl-A couple of times and hit the ''*'' and send the output. I > am really curious to see what the IOAPIC thinks about the > interrupts. > > 2). Increase the dom0_mem= to say 1G?Ok, I''ve done these 2 things and the output is at: http://pridelands.org/~simba/xen/hailstorm-fullserial20110609-02.txt After changing dom0_mem to 1G instead of 512M, the machine hangs at "Switching to clocksource xen" instead of "Trying to unpack rootfs image as initramfs..." As an added step, I also recompiled the latest Mercurial pull of xen-unstable.hg and am running with that instead of the one I had from a month or so ago. I''m not sure if that is what is making it not end up with a Kernel OOPS/BUG message or if it is the apic=debug option. acpi=off is definitely not on the command line, now, though, and it''s not giving the OOPS that it was before. It''s just hanging/freezing.> 3). Try the attached patch (not compile tested)I did not see an attachment on your last e-mail. -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Jun-10 12:59 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On Thu, Jun 09, 2011 at 04:04:23PM -0400, Scott Garron wrote:> On 06/08/2011 03:29 PM, Konrad Rzeszutek Wilk wrote: > >Can you do (while not having acpi=off) and using ''apic=debug'' on you > > Linux line: > > > >1). Run Ctrl-A couple of times and hit the ''*'' and send the output. I > >am really curious to see what the IOAPIC thinks about the > >interrupts. > > > >2). Increase the dom0_mem= to say 1G? > > Ok, I''ve done these 2 things and the output is at: > > http://pridelands.org/~simba/xen/hailstorm-fullserial20110609-02.txt > > After changing dom0_mem to 1G instead of 512M, the machine hangs at > "Switching to clocksource xen" instead of "Trying to unpack rootfs image > as initramfs..." > > As an added step, I also recompiled the latest Mercurial pull of > xen-unstable.hg and am running with that instead of the one I had from a > month or so ago. I''m not sure if that is what is making it not end up > with a Kernel OOPS/BUG message or if it is the apic=debug option. > > acpi=off is definitely not on the command line, now, though, and > it''s not giving the OOPS that it was before. It''s just hanging/freezing. > > >3). Try the attached patch (not compile tested) > > I did not see an attachment on your last e-mail.Uh, try now _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-Jun-10 16:51 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 06/10/2011 08:59 AM, Konrad Rzeszutek Wilk wrote:> Uh, try nowOK, got the patch/attachment this time. I also forgot to respond to this question in my last reply:> It looks as if you are registering the ACPI SCI to 9, but there is > no interrupt service override to 9 - did you add that in the code > yourself?I have not made any modifications to the code other than applying the patch that you just gave me. After applying the patch, I captured the serial console and put the output here: (it includes the CTRL-A, Asterisk (*) output as well) http://pridelands.org/~simba/xen/hailstorm-fullserial20110610.txt -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Jun-13 22:03 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On Fri, Jun 10, 2011 at 12:51:05PM -0400, Scott Garron wrote:> On 06/10/2011 08:59 AM, Konrad Rzeszutek Wilk wrote: > >Uh, try now > > OK, got the patch/attachment this time.Excellent. I am also changing the title of this thread since it is irrelevant to the topic at hand.> > I also forgot to respond to this question in my last reply: > > >It looks as if you are registering the ACPI SCI to 9, but there is > >no interrupt service override to 9 - did you add that in the code > >yourself? > > I have not made any modifications to the code other than applying > the patch that you just gave me.OK> > After applying the patch, I captured the serial console and put the > output here: (it includes the CTRL-A, Asterisk (*) output as well) > > http://pridelands.org/~simba/xen/hailstorm-fullserial20110610.txtOK, we are getting somewhere.From what I see you have two interrupt overrides: (XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) CPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) .. but the FADT has the ACPI interrupt listed as interrupt 9: [ 0.000000] source_irq is 0, global is 2 and FADT is 9 .. well, that would imply that the APIC should be edge, low, but Xen thinks it is level. Can you do one more thing - bootup the same kernel as baremetal? Without any Xen and with the same options .. and also with /proc/interrupts so I can see what native Linux sees? Thanks.> > -- > Scott Garron_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-Jun-13 23:20 UTC
Re: [Xen-devel] BUG: unable to handle kernel paging request - balloon_init - xen-4.1.0 - 2.6.32.39
On 06/13/2011 06:03 PM, Konrad Rzeszutek Wilk wrote:> Can you do one more thing - bootup the same kernel as baremetal? > Without any Xen and with the same options .. and also with > /proc/interrupts so I can see what native Linux sees?The serial console plus cat /proc/interrupts pasted onto the end of it is here: http://pridelands.org/~simba/xen/hailstorm-fullserial20110613.txt -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Jun-14 13:55 UTC
[Xen-devel] BUG: unable to handle kernel NULL pointer dereference at IP: [<ffffffff8105ae4c>] process_one_work+
On Mon, Jun 13, 2011 at 07:20:34PM -0400, Scott Garron wrote:> On 06/13/2011 06:03 PM, Konrad Rzeszutek Wilk wrote: > >Can you do one more thing - bootup the same kernel as baremetal? > >Without any Xen and with the same options .. and also with > >/proc/interrupts so I can see what native Linux sees? > > The serial console plus cat /proc/interrupts pasted onto the end of it > is here:Thank you.> > http://pridelands.org/~simba/xen/hailstorm-fullserial20110613.txtSo IRQ 9 is correct. Somehow I thought that this: [ 1.646560] dc 0FF ACPI Warning: Large Reference Count (0x1FEA) in object ffff88001ebb3b98 (20110316/utdelete-448) [ 4.136398] ACPI Warning: Large Reference Count (0x1FE9) in object ffff88001ebb3b98 (20110316/utdelete-448) [ 4.136426] BUG: unable to handle kernel NULL pointer dereference at (null) [ 4.136436] IP: [<ffffffff8105ae4c>] process_one_work+0x27/0x286 [ 4.136459] PGD 0 [ 4.136465] Oops: 0000 [#1] SMP [ 4.136475] CPU 0 [ 4.136479] Modules linked in: [ 4.136485] [ 4.136492] Pid: 374, comm: kworker/0:1 Tainted: G W 2.6.39+ #2 To Be Filled By O.E.M. To Be Filled By O.E.M./TYAN High-End Dual AMD Opteron, S2882 [ 4.136505] RIP: e030:[<ffffffff8105ae4c>] [<ffffffff8105ae4c>] process_one_work+0x27/0x286 [ 4.136516] RSP: e02b:ffff88001eb4be40 EFLAGS: 00010046 (from http://pridelands.org/~simba/xen/hailstorm-fullserial20110610.txt) are related - as in the ACPI IRQ gets triggered, it does something (and it looks to make the ACPI parser complain about it), then puts some function on the workqueue which dies trying to access ffff88001ebb3b80. It died and whatever that function was suppose to do - it never completed. I was thinking that due to the IRQ 9 having the wrong polarity (which it has not) or trigger (which it has not) it is causing this mayhem - but that is not the case. Sorry about wasting your time heading this wrong path. The boot process continues and the xen clocksource kicks in and it does a hypercall .. and is probabally looping between the hypercall, the xen upcall handler and back. The IRQ 9 is pending so it hasn''t been acknowledged by the Linux kernel. In fact, there are couple of events that are stuck and are locally masked. Which means that ''spin_lock_irqsave'' has been called and it masks the vcpu, but spin_unlock_irqrestore has not - which could be due to process_one_work dying. But the curious thing is that you have two CPUs assigned to Dom0 and while CPU0 looks to be bouncing back and forth, CPU1 is doing something. The RIP is 0xffffffff8108820c. Can you try to run this through System.map? Or the whole bunch of these: ffffffff8108820c ffffffff81088100 ffffffff810881a7 ffffffff8108811a ffffffff816101a8 ffffffff81006c32 ffffffff816114a4 ffffffff8108803a ffffffff8105f5bd ffffffff81618564 ffffffff81617973 ffffffff816117a1 ffffffff81618560 The other idea is to limit Dom0 to only run on one CPU. You can do this by having ''dom0_max_vcpus=1 dom0_vcpus_pin'' and see if it fails somewhere else? It probably will die in the 0xffffffff810013aa :-( But irregardless of what I mentioned above we need to find out why process_one_worker got a toxic parameter. Can you disassemble 0xffffffff8105ae4c and see what it does and how it corresponds to ''process_one_work'' in kernel/workqueue.c? You can also instrument the code to find out what: 1804 work_func_t f = work->func; is. Jeremy, any thoughts on what else might be at foot here? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Scott Garron
2011-Jun-14 21:55 UTC
[Xen-devel] Re: BUG: unable to handle kernel NULL pointer dereference at IP: [<ffffffff8105ae4c>] process_one_work+
On 06/14/2011 09:55 AM, Konrad Rzeszutek Wilk wrote:> But the curious thing is that you have two CPUs assigned to Dom0 and > while CPU0 looks to be bouncing back and forth, CPU1 is doing > something. The RIP is 0xffffffff8108820c. Can you try to run this > through System.map? Or the whole bunch of these: > > ffffffff8108820c ffffffff81088100 ffffffff810881a7 ffffffff8108811a > ffffffff816101a8 ffffffff81006c32 ffffffff816114a4 ffffffff8108803a > ffffffff8105f5bd ffffffff81618564 ffffffff81617973 ffffffff816117a1 > ffffffff81618560I grabbed code snippets for each of these locations and put them here: http://pridelands.org/~simba/xen/hailstorm-debugnotes.txt> The other idea is to limit Dom0 to only run on one CPU. You can do > this by having ''dom0_max_vcpus=1 dom0_vcpus_pin'' and see if it fails > somewhere else? It probably will die in the 0xffffffff810013aa :-(After setting dom0_max_vcpus=1 and dom0_vcpus_pin, the boot got to "Trying to unpack rootfs image as initramfs..." and hung there. The serial console as well as the CTRL_A(x3) * outputs are here: http://pridelands.org/~simba/xen/hailstorm-fullserial20110614.txt> But irregardless of what I mentioned above we need to find out why > process_one_worker got a toxic parameter. Can you disassemble > 0xffffffff8105ae4c and see what it does and how it corresponds to > ''process_one_work'' in kernel/workqueue.c?I put the disassembly of it in the hailstorm-debugnotes.txt file that I mentioned above. Let me know if you need more than that.> You can also instrument the code to find out what: > > 1804 work_func_t f = work->func; > > isI think this request is starting to go a little beyond what I know how to do. -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel