Michael Marineau
2007-Sep-14 22:51 UTC
[Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel
Hey,
I''ve been beating my head against this bug for the last few days.
After Dom0''s memory is reduced it appears that something is trying to
refer to a page that was removed from the machine_to_phys_mapping
table. After much tracing around I haven''t spotted how that could
happen yet though.
System required to reproduce:
x86_32, with or without pae
2 GB of ram or more
3.1.0''s 2.6.18 or things based on it such as redhat''s 2.6.20
xen patch
start dom0 with no memory limit so it uses most of the 2gb
The easiest way to reproduce the problem is to reduce dom0''s memory
significantly (to something like 150M) with either mem-set or by
starting a vary large domU. Then do something, sometimes ls will do,
other times I start compiling glibc. It is also possible to hit the
issue by reducing memory only a little but that will take longer to
hit if at all.
I have been unable to reproduce this with 3.0.4''s 2.6.16 kernel but
2.6.18 will oops on both 3.0.4 and 3.1.0. Also, x86_64 appears to be
ok.
I''m guessing this issue is the same as the oops reported here:
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=975
Below is an example of the oops on my 2.6.18 pae kernel with a couple
extra debuging lines added:
(XEN) mm.c:503:d0 Could not get page ref for pfn 7fffffff
(XEN) mm.c:2324:d0 mfn: 7fffffff, gmfn: 7fffffff, ptr: 7fffffff0c0
(XEN) mm.c:2325:d0 Could not get page for normal update
virtptr: f57a70c0 machineptr: 7fffffff0c0
------------[ cut here ]------------
kernel BUG at arch/i386/mm/hypervisor.c:62!
invalid opcode: 0000 [#1]
SMP
Modules linked in:
CPU: 1
EIP: 0061:[<c0117875>] Not tainted VLI
EFLAGS: 00010296 (2.6.18-xen-r5-try2 #6)
EIP is at xen_l1_entry_update+0xb9/0xde
eax: 0000002d ebx: deadbeef ecx: 00000000 edx: 00000001
esi: deadbeef edi: 00000000 ebp: ecea0c4c esp: ecea0c14
ds: 007b es: 007b ss: 0069
Process bash (pid: 5065, ti=ecea0000 task=ecfe3030 task.ti=ecea0000)
Stack: c037b964 f57a70c0 fffff0c0 000007ff 00000000 00000000 f57a70c0 fffff0c0
000007ff 00000000 00000000 00000000 00000000 00000000 ecea0cc0 c0158693
3536f025 00000000 ed383780 ed3837c8 c04bce70 00000000 00000004 00000000
Call Trace:
[<c0158693>] zap_pte_range+0x265/0x658
[<c0158bf2>] unmap_page_range+0x16c/0x2b4
[<c0158e08>] unmap_vmas+0xce/0x1cb
[<c015f094>] exit_mmap+0x7d/0xf4
[<c011e0cf>] mmput+0x36/0x8c
[<c01782af>] exec_mmap+0x156/0x229
[<c0178a54>] flush_old_exec+0x59/0x25a
[<c01989f4>] load_elf_binary+0x33c/0xc52
[<c0178f06>] search_binary_handler+0x89/0x23c
[<c0197c95>] load_script+0x221/0x23c
[<c0178f06>] search_binary_handler+0x89/0x23c
[<c017920b>] do_execve+0x152/0x1be
[<c010391c>] sys_execve+0x32/0x84
[<c0104dfb>] syscall_call+0x7/0xb
[<b7e13899>] 0xb7e13899
Code: 78 08 83 c4 2c 5b 5e 5f 5d c3 8b 45 e4 8b 55 e8 89 54 24 0c 89
44 24 08 8b 45 e
EIP: [<c0117875>] xen_l1_entry_update+0xb9/0xde SS:ESP 0069:ecea0c14
And just for kicks a non-pae oops:
(XEN) mm.c:503:d0 Could not get page ref for pfn fffff
(XEN) mm.c:2324:d0 mfn: fffff, gmfn: fffff, ptr: fffff060
(XEN) mm.c:2325:d0 Could not get page for normal update
virtptr: fbfa7060 machineptr: fffff060
------------[ cut here ]------------
kernel BUG at arch/i386/mm/hypervisor.c:62!
invalid opcode: 0000 [#1]
SMP
Modules linked in:
CPU: 1
EIP: 0061:[<c01158e1>] Not tainted VLI
EFLAGS: 00010282 (2.6.18-xen-r5-try2 #4)
EIP is at xen_l1_entry_update+0xa1/0xb1
eax: 0000002a ebx: deadbeef ecx: 00000000 edx: 00000001
esi: deadbeef edi: fbfa7060 ebp: c0bcbca0 esp: c0bcbc74
ds: 007b es: 007b ss: 0069
Process bash (pid: 4943, ti=c0bcb000 task=c1fd7030 task.ti=c0bcb000)
Stack: c036508c fbfa7060 fffff060 00000000 fffff060 00000000 00000000 00000000
fbfa7060 3b875025 f3bce3c0 c0bcbd20 c0152f4b c0bcbd10 f35ff840 80018000
00000000 f35bb860 c0bcbd38 003fefe8 00000000 00000001 800c9000 f3be7800
Call Trace:
[<c0152f4b>] unmap_vmas+0x4d4/0x743
[<c0156b36>] exit_mmap+0x7f/0xf4
[<c011b779>] mmput+0x24/0x85
[<c016fd62>] flush_old_exec+0x2de/0xa6d
[<c018fad0>] load_elf_binary+0x51d/0x1a4d
[<c016f23e>] search_binary_handler+0x8d/0x22c
[<c0170eca>] do_execve+0x14d/0x1c9
[<c01034be>] sys_execve+0x2e/0x76
[<c0104e83>] syscall_call+0x7/0xb
[<b7ecb899>] 0xb7ecb899
Code: c1 72 af 0f 0b 22 00 54 29 36 c0 eb a5 8b 45 e4 8b 55 e8 89 44
24 08 89 54 24 0
EIP: [<c01158e1>] xen_l1_entry_update+0xa1/0xb1 SS:ESP 0069:c0bcbc74
The call trace''s tend to differ, but the above two are pretty common.
The oops is in xen_l1_entry_update almost all of the time, I have seen
it in xen_l2_entry_update
Thanks,
--
Michael Marineau
Oregon State University
mike@marineau.org
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Keir Fraser
2007-Sep-15 07:07 UTC
Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel
On 14/9/07 23:51, "Michael Marineau" <mike@marineau.org> wrote:> I have been unable to reproduce this with 3.0.4''s 2.6.16 kernel but > 2.6.18 will oops on both 3.0.4 and 3.1.0. Also, x86_64 appears to be > ok. > > I''m guessing this issue is the same as the oops reported here: > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=975 > > Below is an example of the oops on my 2.6.18 pae kernel with a couple > extra debuging lines added:Looks like xen_l1_entry_update() is passed a virtual address which has no corresponding machine address. So the pte page or its mapping is corrupted somehow. deadbeef in the register dumps is also not a good sign. I''ll have a go at repro''ing. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Michael Marineau
2007-Sep-17 23:56 UTC
Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel
On 9/15/07, Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:> On 14/9/07 23:51, "Michael Marineau" <mike@marineau.org> wrote: > > > I have been unable to reproduce this with 3.0.4''s 2.6.16 kernel but > > 2.6.18 will oops on both 3.0.4 and 3.1.0. Also, x86_64 appears to be > > ok. > > > > I''m guessing this issue is the same as the oops reported here: > > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=975 > > > > Below is an example of the oops on my 2.6.18 pae kernel with a couple > > extra debuging lines added: > > Looks like xen_l1_entry_update() is passed a virtual address which has no > corresponding machine address. So the pte page or its mapping is corrupted > somehow. deadbeef in the register dumps is also not a good sign. I''ll have a > go at repro''ing. > > -- Keir > > >As for the deadbeef, I''m kind of doubt it is important. Those values show up after the hypercall to xen. Using the attached patch which checks for the bogus value prior to the call I get the following oops: virtptr: f57b40c0 machineptr: 7fffffff0c0 ------------[ cut here ]------------ kernel BUG at arch/i386/mm/hypervisor.c:64! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 0 EIP: 0061:[<c0117893>] Not tainted VLI EFLAGS: 00010286 (2.6.18-xen-r5-try2 #10) EIP is at xen_l1_entry_update+0xd7/0x100 eax: 0000002d ebx: 00000000 ecx: 00000000 edx: 00000001 esi: fffff0c0 edi: 000007ff ebp: ed45cd10 esp: ed45ccd8 ds: 007b es: 007b ss: 0069 Process bash (pid: 5044, ti=ed45c000 task=ec835a70 task.ti=ed45c000) Stack: c037b964 f57b40c0 fffff0c0 000007ff 00000000 00000000 f57b40c0 fffff0c0 000007ff 00000000 00000000 00000000 00000000 00000000 ed45cd84 c01586b7 35371025 00000000 ecd95ec0 ecd95f08 c04bce70 00000000 00000004 00000000 Call Trace: [<c01586b7>] zap_pte_range+0x265/0x658 [<c0158c16>] unmap_page_range+0x16c/0x2b4 [<c0158e2c>] unmap_vmas+0xce/0x1cb [<c015f0b8>] exit_mmap+0x7d/0xf4 [<c011e0f3>] mmput+0x36/0x8c [<c01782d3>] exec_mmap+0x156/0x229 [<c0178a78>] flush_old_exec+0x59/0x25a [<c0198a18>] load_elf_binary+0x33c/0xc52 [<c0178f2a>] search_binary_handler+0x89/0x23c [<c017922f>] do_execve+0x152/0x1be [<c010391c>] sys_execve+0x32/0x84 [<c0104dfb>] syscall_call+0x7/0xb [<b7efd899>] 0xb7efd899 Code: b4 97 fe ff 85 c0 78 42 83 c4 2c 5b 5e 5f 5d c3 8b 45 e0 89 74 24 08 89 7c 24 0 EIP: [<c0117893>] xen_l1_entry_update+0xd7/0x100 SS:ESP 0069:ed45ccd8 -- Michael Marineau Oregon State University mike@marineau.org _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Michael Marineau
2007-Oct-03 20:39 UTC
Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel
On 9/17/07, Michael Marineau <mike@marineau.org> wrote:> On 9/15/07, Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote: > > On 14/9/07 23:51, "Michael Marineau" <mike@marineau.org> wrote: > > > > > I have been unable to reproduce this with 3.0.4''s 2.6.16 kernel but > > > 2.6.18 will oops on both 3.0.4 and 3.1.0. Also, x86_64 appears to be > > > ok. > > > > > > I''m guessing this issue is the same as the oops reported here: > > > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=975 > > > > > > Below is an example of the oops on my 2.6.18 pae kernel with a couple > > > extra debuging lines added: > > > > Looks like xen_l1_entry_update() is passed a virtual address which has no > > corresponding machine address. So the pte page or its mapping is corrupted > > somehow. deadbeef in the register dumps is also not a good sign. I''ll have a > > go at repro''ing. > > > > -- Keir > > > > > > > > As for the deadbeef, I''m kind of doubt it is important. Those values > show up after the hypercall to xen. Using the attached patch which > checks for the bogus value prior to the call I get the following oops: > > virtptr: f57b40c0 machineptr: 7fffffff0c0 > ------------[ cut here ]------------ > kernel BUG at arch/i386/mm/hypervisor.c:64! > invalid opcode: 0000 [#1] > SMP > Modules linked in: > CPU: 0 > EIP: 0061:[<c0117893>] Not tainted VLI > EFLAGS: 00010286 (2.6.18-xen-r5-try2 #10) > EIP is at xen_l1_entry_update+0xd7/0x100 > eax: 0000002d ebx: 00000000 ecx: 00000000 edx: 00000001 > esi: fffff0c0 edi: 000007ff ebp: ed45cd10 esp: ed45ccd8 > ds: 007b es: 007b ss: 0069 > Process bash (pid: 5044, ti=ed45c000 task=ec835a70 task.ti=ed45c000) > Stack: c037b964 f57b40c0 fffff0c0 000007ff 00000000 00000000 f57b40c0 fffff0c0 > 000007ff 00000000 00000000 00000000 00000000 00000000 ed45cd84 c01586b7 > 35371025 00000000 ecd95ec0 ecd95f08 c04bce70 00000000 00000004 00000000 > Call Trace: > [<c01586b7>] zap_pte_range+0x265/0x658 > [<c0158c16>] unmap_page_range+0x16c/0x2b4 > [<c0158e2c>] unmap_vmas+0xce/0x1cb > [<c015f0b8>] exit_mmap+0x7d/0xf4 > [<c011e0f3>] mmput+0x36/0x8c > [<c01782d3>] exec_mmap+0x156/0x229 > [<c0178a78>] flush_old_exec+0x59/0x25a > [<c0198a18>] load_elf_binary+0x33c/0xc52 > [<c0178f2a>] search_binary_handler+0x89/0x23c > [<c017922f>] do_execve+0x152/0x1be > [<c010391c>] sys_execve+0x32/0x84 > [<c0104dfb>] syscall_call+0x7/0xb > [<b7efd899>] 0xb7efd899 > Code: b4 97 fe ff 85 c0 78 42 83 c4 2c 5b 5e 5f 5d c3 8b 45 e0 89 74 > 24 08 89 7c 24 0 > EIP: [<c0117893>] xen_l1_entry_update+0xd7/0x100 SS:ESP 0069:ed45ccd8I can still reproduce this problem on the 3.1.1-rc2 xen kernel. Has anyone had a chance to take a look at this or try to reproduce it? I can reproduce this far to easily :-( Is there any further debugging information I can provide? -- Michael Marineau Oregon State University mike@marineau.org _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2007-Oct-04 09:35 UTC
Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel
On 3/10/07 21:39, "Michael Marineau" <mike@marineau.org> wrote:>> Code: b4 97 fe ff 85 c0 78 42 83 c4 2c 5b 5e 5f 5d c3 8b 45 e0 89 74 >> 24 08 89 7c 24 0 >> EIP: [<c0117893>] xen_l1_entry_update+0xd7/0x100 SS:ESP 0069:ed45ccd8 > > I can still reproduce this problem on the 3.1.1-rc2 xen kernel. Has > anyone had a chance to take a look at this or try to reproduce it? I > can reproduce this far to easily :-(I''ll need to find a test box with more than 2GB of memory... -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel