Michael Marineau
2007-Sep-14 22:51 UTC
[Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel
Hey, I''ve been beating my head against this bug for the last few days. After Dom0''s memory is reduced it appears that something is trying to refer to a page that was removed from the machine_to_phys_mapping table. After much tracing around I haven''t spotted how that could happen yet though. System required to reproduce: x86_32, with or without pae 2 GB of ram or more 3.1.0''s 2.6.18 or things based on it such as redhat''s 2.6.20 xen patch start dom0 with no memory limit so it uses most of the 2gb The easiest way to reproduce the problem is to reduce dom0''s memory significantly (to something like 150M) with either mem-set or by starting a vary large domU. Then do something, sometimes ls will do, other times I start compiling glibc. It is also possible to hit the issue by reducing memory only a little but that will take longer to hit if at all. I have been unable to reproduce this with 3.0.4''s 2.6.16 kernel but 2.6.18 will oops on both 3.0.4 and 3.1.0. Also, x86_64 appears to be ok. I''m guessing this issue is the same as the oops reported here: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=975 Below is an example of the oops on my 2.6.18 pae kernel with a couple extra debuging lines added: (XEN) mm.c:503:d0 Could not get page ref for pfn 7fffffff (XEN) mm.c:2324:d0 mfn: 7fffffff, gmfn: 7fffffff, ptr: 7fffffff0c0 (XEN) mm.c:2325:d0 Could not get page for normal update virtptr: f57a70c0 machineptr: 7fffffff0c0 ------------[ cut here ]------------ kernel BUG at arch/i386/mm/hypervisor.c:62! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 1 EIP: 0061:[<c0117875>] Not tainted VLI EFLAGS: 00010296 (2.6.18-xen-r5-try2 #6) EIP is at xen_l1_entry_update+0xb9/0xde eax: 0000002d ebx: deadbeef ecx: 00000000 edx: 00000001 esi: deadbeef edi: 00000000 ebp: ecea0c4c esp: ecea0c14 ds: 007b es: 007b ss: 0069 Process bash (pid: 5065, ti=ecea0000 task=ecfe3030 task.ti=ecea0000) Stack: c037b964 f57a70c0 fffff0c0 000007ff 00000000 00000000 f57a70c0 fffff0c0 000007ff 00000000 00000000 00000000 00000000 00000000 ecea0cc0 c0158693 3536f025 00000000 ed383780 ed3837c8 c04bce70 00000000 00000004 00000000 Call Trace: [<c0158693>] zap_pte_range+0x265/0x658 [<c0158bf2>] unmap_page_range+0x16c/0x2b4 [<c0158e08>] unmap_vmas+0xce/0x1cb [<c015f094>] exit_mmap+0x7d/0xf4 [<c011e0cf>] mmput+0x36/0x8c [<c01782af>] exec_mmap+0x156/0x229 [<c0178a54>] flush_old_exec+0x59/0x25a [<c01989f4>] load_elf_binary+0x33c/0xc52 [<c0178f06>] search_binary_handler+0x89/0x23c [<c0197c95>] load_script+0x221/0x23c [<c0178f06>] search_binary_handler+0x89/0x23c [<c017920b>] do_execve+0x152/0x1be [<c010391c>] sys_execve+0x32/0x84 [<c0104dfb>] syscall_call+0x7/0xb [<b7e13899>] 0xb7e13899 Code: 78 08 83 c4 2c 5b 5e 5f 5d c3 8b 45 e4 8b 55 e8 89 54 24 0c 89 44 24 08 8b 45 e EIP: [<c0117875>] xen_l1_entry_update+0xb9/0xde SS:ESP 0069:ecea0c14 And just for kicks a non-pae oops: (XEN) mm.c:503:d0 Could not get page ref for pfn fffff (XEN) mm.c:2324:d0 mfn: fffff, gmfn: fffff, ptr: fffff060 (XEN) mm.c:2325:d0 Could not get page for normal update virtptr: fbfa7060 machineptr: fffff060 ------------[ cut here ]------------ kernel BUG at arch/i386/mm/hypervisor.c:62! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 1 EIP: 0061:[<c01158e1>] Not tainted VLI EFLAGS: 00010282 (2.6.18-xen-r5-try2 #4) EIP is at xen_l1_entry_update+0xa1/0xb1 eax: 0000002a ebx: deadbeef ecx: 00000000 edx: 00000001 esi: deadbeef edi: fbfa7060 ebp: c0bcbca0 esp: c0bcbc74 ds: 007b es: 007b ss: 0069 Process bash (pid: 4943, ti=c0bcb000 task=c1fd7030 task.ti=c0bcb000) Stack: c036508c fbfa7060 fffff060 00000000 fffff060 00000000 00000000 00000000 fbfa7060 3b875025 f3bce3c0 c0bcbd20 c0152f4b c0bcbd10 f35ff840 80018000 00000000 f35bb860 c0bcbd38 003fefe8 00000000 00000001 800c9000 f3be7800 Call Trace: [<c0152f4b>] unmap_vmas+0x4d4/0x743 [<c0156b36>] exit_mmap+0x7f/0xf4 [<c011b779>] mmput+0x24/0x85 [<c016fd62>] flush_old_exec+0x2de/0xa6d [<c018fad0>] load_elf_binary+0x51d/0x1a4d [<c016f23e>] search_binary_handler+0x8d/0x22c [<c0170eca>] do_execve+0x14d/0x1c9 [<c01034be>] sys_execve+0x2e/0x76 [<c0104e83>] syscall_call+0x7/0xb [<b7ecb899>] 0xb7ecb899 Code: c1 72 af 0f 0b 22 00 54 29 36 c0 eb a5 8b 45 e4 8b 55 e8 89 44 24 08 89 54 24 0 EIP: [<c01158e1>] xen_l1_entry_update+0xa1/0xb1 SS:ESP 0069:c0bcbc74 The call trace''s tend to differ, but the above two are pretty common. The oops is in xen_l1_entry_update almost all of the time, I have seen it in xen_l2_entry_update Thanks, -- Michael Marineau Oregon State University mike@marineau.org _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2007-Sep-15 07:07 UTC
Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel
On 14/9/07 23:51, "Michael Marineau" <mike@marineau.org> wrote:> I have been unable to reproduce this with 3.0.4''s 2.6.16 kernel but > 2.6.18 will oops on both 3.0.4 and 3.1.0. Also, x86_64 appears to be > ok. > > I''m guessing this issue is the same as the oops reported here: > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=975 > > Below is an example of the oops on my 2.6.18 pae kernel with a couple > extra debuging lines added:Looks like xen_l1_entry_update() is passed a virtual address which has no corresponding machine address. So the pte page or its mapping is corrupted somehow. deadbeef in the register dumps is also not a good sign. I''ll have a go at repro''ing. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Michael Marineau
2007-Sep-17 23:56 UTC
Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel
On 9/15/07, Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:> On 14/9/07 23:51, "Michael Marineau" <mike@marineau.org> wrote: > > > I have been unable to reproduce this with 3.0.4''s 2.6.16 kernel but > > 2.6.18 will oops on both 3.0.4 and 3.1.0. Also, x86_64 appears to be > > ok. > > > > I''m guessing this issue is the same as the oops reported here: > > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=975 > > > > Below is an example of the oops on my 2.6.18 pae kernel with a couple > > extra debuging lines added: > > Looks like xen_l1_entry_update() is passed a virtual address which has no > corresponding machine address. So the pte page or its mapping is corrupted > somehow. deadbeef in the register dumps is also not a good sign. I''ll have a > go at repro''ing. > > -- Keir > > >As for the deadbeef, I''m kind of doubt it is important. Those values show up after the hypercall to xen. Using the attached patch which checks for the bogus value prior to the call I get the following oops: virtptr: f57b40c0 machineptr: 7fffffff0c0 ------------[ cut here ]------------ kernel BUG at arch/i386/mm/hypervisor.c:64! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 0 EIP: 0061:[<c0117893>] Not tainted VLI EFLAGS: 00010286 (2.6.18-xen-r5-try2 #10) EIP is at xen_l1_entry_update+0xd7/0x100 eax: 0000002d ebx: 00000000 ecx: 00000000 edx: 00000001 esi: fffff0c0 edi: 000007ff ebp: ed45cd10 esp: ed45ccd8 ds: 007b es: 007b ss: 0069 Process bash (pid: 5044, ti=ed45c000 task=ec835a70 task.ti=ed45c000) Stack: c037b964 f57b40c0 fffff0c0 000007ff 00000000 00000000 f57b40c0 fffff0c0 000007ff 00000000 00000000 00000000 00000000 00000000 ed45cd84 c01586b7 35371025 00000000 ecd95ec0 ecd95f08 c04bce70 00000000 00000004 00000000 Call Trace: [<c01586b7>] zap_pte_range+0x265/0x658 [<c0158c16>] unmap_page_range+0x16c/0x2b4 [<c0158e2c>] unmap_vmas+0xce/0x1cb [<c015f0b8>] exit_mmap+0x7d/0xf4 [<c011e0f3>] mmput+0x36/0x8c [<c01782d3>] exec_mmap+0x156/0x229 [<c0178a78>] flush_old_exec+0x59/0x25a [<c0198a18>] load_elf_binary+0x33c/0xc52 [<c0178f2a>] search_binary_handler+0x89/0x23c [<c017922f>] do_execve+0x152/0x1be [<c010391c>] sys_execve+0x32/0x84 [<c0104dfb>] syscall_call+0x7/0xb [<b7efd899>] 0xb7efd899 Code: b4 97 fe ff 85 c0 78 42 83 c4 2c 5b 5e 5f 5d c3 8b 45 e0 89 74 24 08 89 7c 24 0 EIP: [<c0117893>] xen_l1_entry_update+0xd7/0x100 SS:ESP 0069:ed45ccd8 -- Michael Marineau Oregon State University mike@marineau.org _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Michael Marineau
2007-Oct-03 20:39 UTC
Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel
On 9/17/07, Michael Marineau <mike@marineau.org> wrote:> On 9/15/07, Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote: > > On 14/9/07 23:51, "Michael Marineau" <mike@marineau.org> wrote: > > > > > I have been unable to reproduce this with 3.0.4''s 2.6.16 kernel but > > > 2.6.18 will oops on both 3.0.4 and 3.1.0. Also, x86_64 appears to be > > > ok. > > > > > > I''m guessing this issue is the same as the oops reported here: > > > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=975 > > > > > > Below is an example of the oops on my 2.6.18 pae kernel with a couple > > > extra debuging lines added: > > > > Looks like xen_l1_entry_update() is passed a virtual address which has no > > corresponding machine address. So the pte page or its mapping is corrupted > > somehow. deadbeef in the register dumps is also not a good sign. I''ll have a > > go at repro''ing. > > > > -- Keir > > > > > > > > As for the deadbeef, I''m kind of doubt it is important. Those values > show up after the hypercall to xen. Using the attached patch which > checks for the bogus value prior to the call I get the following oops: > > virtptr: f57b40c0 machineptr: 7fffffff0c0 > ------------[ cut here ]------------ > kernel BUG at arch/i386/mm/hypervisor.c:64! > invalid opcode: 0000 [#1] > SMP > Modules linked in: > CPU: 0 > EIP: 0061:[<c0117893>] Not tainted VLI > EFLAGS: 00010286 (2.6.18-xen-r5-try2 #10) > EIP is at xen_l1_entry_update+0xd7/0x100 > eax: 0000002d ebx: 00000000 ecx: 00000000 edx: 00000001 > esi: fffff0c0 edi: 000007ff ebp: ed45cd10 esp: ed45ccd8 > ds: 007b es: 007b ss: 0069 > Process bash (pid: 5044, ti=ed45c000 task=ec835a70 task.ti=ed45c000) > Stack: c037b964 f57b40c0 fffff0c0 000007ff 00000000 00000000 f57b40c0 fffff0c0 > 000007ff 00000000 00000000 00000000 00000000 00000000 ed45cd84 c01586b7 > 35371025 00000000 ecd95ec0 ecd95f08 c04bce70 00000000 00000004 00000000 > Call Trace: > [<c01586b7>] zap_pte_range+0x265/0x658 > [<c0158c16>] unmap_page_range+0x16c/0x2b4 > [<c0158e2c>] unmap_vmas+0xce/0x1cb > [<c015f0b8>] exit_mmap+0x7d/0xf4 > [<c011e0f3>] mmput+0x36/0x8c > [<c01782d3>] exec_mmap+0x156/0x229 > [<c0178a78>] flush_old_exec+0x59/0x25a > [<c0198a18>] load_elf_binary+0x33c/0xc52 > [<c0178f2a>] search_binary_handler+0x89/0x23c > [<c017922f>] do_execve+0x152/0x1be > [<c010391c>] sys_execve+0x32/0x84 > [<c0104dfb>] syscall_call+0x7/0xb > [<b7efd899>] 0xb7efd899 > Code: b4 97 fe ff 85 c0 78 42 83 c4 2c 5b 5e 5f 5d c3 8b 45 e0 89 74 > 24 08 89 7c 24 0 > EIP: [<c0117893>] xen_l1_entry_update+0xd7/0x100 SS:ESP 0069:ed45ccd8I can still reproduce this problem on the 3.1.1-rc2 xen kernel. Has anyone had a chance to take a look at this or try to reproduce it? I can reproduce this far to easily :-( Is there any further debugging information I can provide? -- Michael Marineau Oregon State University mike@marineau.org _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2007-Oct-04 09:35 UTC
Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel
On 3/10/07 21:39, "Michael Marineau" <mike@marineau.org> wrote:>> Code: b4 97 fe ff 85 c0 78 42 83 c4 2c 5b 5e 5f 5d c3 8b 45 e0 89 74 >> 24 08 89 7c 24 0 >> EIP: [<c0117893>] xen_l1_entry_update+0xd7/0x100 SS:ESP 0069:ed45ccd8 > > I can still reproduce this problem on the 3.1.1-rc2 xen kernel. Has > anyone had a chance to take a look at this or try to reproduce it? I > can reproduce this far to easily :-(I''ll need to find a test box with more than 2GB of memory... -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel