thr3ads.net - Xen devel - [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel [Sep 2007]

If this information is useful, please help other people find it:
Share via:

Michael Marineau

2007-Sep-14 22:51 UTC

[Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel

Hey,
I''ve been beating my head against this bug for the last few days.
After Dom0''s memory is reduced it appears that something is trying to
refer to a page that was removed from the machine_to_phys_mapping
table. After much tracing around I haven''t spotted how that could
happen yet though.

System required to reproduce:
x86_32, with or without pae
2 GB of ram or more
3.1.0''s 2.6.18 or things based on it such as redhat''s 2.6.20
xen patch
start dom0 with no memory limit so it uses most of the 2gb

The easiest way to reproduce the problem is to reduce dom0''s memory
significantly (to something like 150M) with either mem-set or by
starting a vary large domU. Then do something, sometimes ls will do,
other times I start compiling glibc. It is also possible to hit the
issue by reducing memory only a little but that will take longer to
hit if at all.

I have been unable to reproduce this with 3.0.4''s 2.6.16 kernel but
2.6.18 will oops on both 3.0.4 and 3.1.0. Also, x86_64 appears to be
ok.

I''m guessing this issue is the same as the oops reported here:
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=975

Below is an example of the oops on my 2.6.18 pae kernel with a couple
extra debuging lines added:

(XEN) mm.c:503:d0 Could not get page ref for pfn 7fffffff
(XEN) mm.c:2324:d0 mfn: 7fffffff, gmfn: 7fffffff, ptr: 7fffffff0c0
(XEN) mm.c:2325:d0 Could not get page for normal update
virtptr: f57a70c0 machineptr: 7fffffff0c0
------------[ cut here ]------------
kernel BUG at arch/i386/mm/hypervisor.c:62!
invalid opcode: 0000 [#1]
SMP
Modules linked in:
CPU:    1
EIP:    0061:[<c0117875>]    Not tainted VLI
EFLAGS: 00010296   (2.6.18-xen-r5-try2 #6)
EIP is at xen_l1_entry_update+0xb9/0xde
eax: 0000002d   ebx: deadbeef   ecx: 00000000   edx: 00000001
esi: deadbeef   edi: 00000000   ebp: ecea0c4c   esp: ecea0c14
ds: 007b   es: 007b   ss: 0069
Process bash (pid: 5065, ti=ecea0000 task=ecfe3030 task.ti=ecea0000)
Stack: c037b964 f57a70c0 fffff0c0 000007ff 00000000 00000000 f57a70c0 fffff0c0
       000007ff 00000000 00000000 00000000 00000000 00000000 ecea0cc0 c0158693
       3536f025 00000000 ed383780 ed3837c8 c04bce70 00000000 00000004 00000000
Call Trace:
 [<c0158693>] zap_pte_range+0x265/0x658
 [<c0158bf2>] unmap_page_range+0x16c/0x2b4
 [<c0158e08>] unmap_vmas+0xce/0x1cb
 [<c015f094>] exit_mmap+0x7d/0xf4
 [<c011e0cf>] mmput+0x36/0x8c
 [<c01782af>] exec_mmap+0x156/0x229
 [<c0178a54>] flush_old_exec+0x59/0x25a
 [<c01989f4>] load_elf_binary+0x33c/0xc52
 [<c0178f06>] search_binary_handler+0x89/0x23c
 [<c0197c95>] load_script+0x221/0x23c
 [<c0178f06>] search_binary_handler+0x89/0x23c
 [<c017920b>] do_execve+0x152/0x1be
 [<c010391c>] sys_execve+0x32/0x84
 [<c0104dfb>] syscall_call+0x7/0xb
 [<b7e13899>] 0xb7e13899
Code: 78 08 83 c4 2c 5b 5e 5f 5d c3 8b 45 e4 8b 55 e8 89 54 24 0c 89
44 24 08 8b 45 e
EIP: [<c0117875>] xen_l1_entry_update+0xb9/0xde SS:ESP 0069:ecea0c14

And just for kicks a non-pae oops:

(XEN) mm.c:503:d0 Could not get page ref for pfn fffff
(XEN) mm.c:2324:d0 mfn: fffff, gmfn: fffff, ptr: fffff060
(XEN) mm.c:2325:d0 Could not get page for normal update
virtptr: fbfa7060 machineptr: fffff060
------------[ cut here ]------------
kernel BUG at arch/i386/mm/hypervisor.c:62!
invalid opcode: 0000 [#1]
SMP
Modules linked in:
CPU:    1
EIP:    0061:[<c01158e1>]    Not tainted VLI
EFLAGS: 00010282   (2.6.18-xen-r5-try2 #4)
EIP is at xen_l1_entry_update+0xa1/0xb1
eax: 0000002a   ebx: deadbeef   ecx: 00000000   edx: 00000001
esi: deadbeef   edi: fbfa7060   ebp: c0bcbca0   esp: c0bcbc74
ds: 007b   es: 007b   ss: 0069
Process bash (pid: 4943, ti=c0bcb000 task=c1fd7030 task.ti=c0bcb000)
Stack: c036508c fbfa7060 fffff060 00000000 fffff060 00000000 00000000 00000000
       fbfa7060 3b875025 f3bce3c0 c0bcbd20 c0152f4b c0bcbd10 f35ff840 80018000
       00000000 f35bb860 c0bcbd38 003fefe8 00000000 00000001 800c9000 f3be7800
Call Trace:
 [<c0152f4b>] unmap_vmas+0x4d4/0x743
 [<c0156b36>] exit_mmap+0x7f/0xf4
 [<c011b779>] mmput+0x24/0x85
 [<c016fd62>] flush_old_exec+0x2de/0xa6d
 [<c018fad0>] load_elf_binary+0x51d/0x1a4d
 [<c016f23e>] search_binary_handler+0x8d/0x22c
 [<c0170eca>] do_execve+0x14d/0x1c9
 [<c01034be>] sys_execve+0x2e/0x76
 [<c0104e83>] syscall_call+0x7/0xb
 [<b7ecb899>] 0xb7ecb899
Code: c1 72 af 0f 0b 22 00 54 29 36 c0 eb a5 8b 45 e4 8b 55 e8 89 44
24 08 89 54 24 0
EIP: [<c01158e1>] xen_l1_entry_update+0xa1/0xb1 SS:ESP 0069:c0bcbc74

The call trace''s tend to differ, but the above two are pretty common.
The oops is in xen_l1_entry_update almost all of the time, I have seen
it in xen_l2_entry_update

Thanks,
-- 
Michael Marineau
Oregon State University
mike@marineau.org

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Sep-15 07:07 UTC

head link

Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel

On 14/9/07 23:51, "Michael Marineau" <mike@marineau.org> wrote:
> I have been unable to reproduce this with 3.0.4''s 2.6.16 kernel
but
> 2.6.18 will oops on both 3.0.4 and 3.1.0. Also, x86_64 appears to be
> ok.
> 
> I''m guessing this issue is the same as the oops reported here:
> http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=975
> 
> Below is an example of the oops on my 2.6.18 pae kernel with a couple
> extra debuging lines added:
Looks like xen_l1_entry_update() is passed a virtual address which has no
corresponding machine address. So the pte page or its mapping is corrupted
somehow. deadbeef in the register dumps is also not a good sign. I''ll
have a
go at repro''ing.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Michael Marineau

2007-Sep-17 23:56 UTC

head link

Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel

On 9/15/07, Keir Fraser <Keir.Fraser@cl.cam.ac.uk>
wrote:> On 14/9/07 23:51, "Michael Marineau" <mike@marineau.org>
wrote:
>
> > I have been unable to reproduce this with 3.0.4''s 2.6.16
kernel but
> > 2.6.18 will oops on both 3.0.4 and 3.1.0. Also, x86_64 appears to be
> > ok.
> >
> > I''m guessing this issue is the same as the oops reported
here:
> > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=975
> >
> > Below is an example of the oops on my 2.6.18 pae kernel with a couple
> > extra debuging lines added:
>
> Looks like xen_l1_entry_update() is passed a virtual address which has no
> corresponding machine address. So the pte page or its mapping is corrupted
> somehow. deadbeef in the register dumps is also not a good sign.
I''ll have a
> go at repro''ing.
>
>  -- Keir
>
>
>
As for the deadbeef, I''m kind of doubt it is important. Those values
show up after the hypercall to xen. Using the attached patch which
checks for the bogus value prior to the call I get the following oops:

virtptr: f57b40c0 machineptr: 7fffffff0c0
------------[ cut here ]------------
kernel BUG at arch/i386/mm/hypervisor.c:64!
invalid opcode: 0000 [#1]
SMP
Modules linked in:
CPU:    0
EIP:    0061:[<c0117893>]    Not tainted VLI
EFLAGS: 00010286   (2.6.18-xen-r5-try2 #10)
EIP is at xen_l1_entry_update+0xd7/0x100
eax: 0000002d   ebx: 00000000   ecx: 00000000   edx: 00000001
esi: fffff0c0   edi: 000007ff   ebp: ed45cd10   esp: ed45ccd8
ds: 007b   es: 007b   ss: 0069
Process bash (pid: 5044, ti=ed45c000 task=ec835a70 task.ti=ed45c000)
Stack: c037b964 f57b40c0 fffff0c0 000007ff 00000000 00000000 f57b40c0 fffff0c0
       000007ff 00000000 00000000 00000000 00000000 00000000 ed45cd84 c01586b7
       35371025 00000000 ecd95ec0 ecd95f08 c04bce70 00000000 00000004 00000000
Call Trace:
 [<c01586b7>] zap_pte_range+0x265/0x658
 [<c0158c16>] unmap_page_range+0x16c/0x2b4
 [<c0158e2c>] unmap_vmas+0xce/0x1cb
 [<c015f0b8>] exit_mmap+0x7d/0xf4
 [<c011e0f3>] mmput+0x36/0x8c
 [<c01782d3>] exec_mmap+0x156/0x229
 [<c0178a78>] flush_old_exec+0x59/0x25a
 [<c0198a18>] load_elf_binary+0x33c/0xc52
 [<c0178f2a>] search_binary_handler+0x89/0x23c
 [<c017922f>] do_execve+0x152/0x1be
 [<c010391c>] sys_execve+0x32/0x84
 [<c0104dfb>] syscall_call+0x7/0xb
 [<b7efd899>] 0xb7efd899
Code: b4 97 fe ff 85 c0 78 42 83 c4 2c 5b 5e 5f 5d c3 8b 45 e0 89 74
24 08 89 7c 24 0
EIP: [<c0117893>] xen_l1_entry_update+0xd7/0x100 SS:ESP 0069:ed45ccd8

-- 
Michael Marineau
Oregon State University
mike@marineau.org


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Michael Marineau

2007-Oct-03 20:39 UTC

head link

Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel

On 9/17/07, Michael Marineau <mike@marineau.org>
wrote:> On 9/15/07, Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:
> > On 14/9/07 23:51, "Michael Marineau"
<mike@marineau.org> wrote:
> >
> > > I have been unable to reproduce this with 3.0.4''s 2.6.16
kernel but
> > > 2.6.18 will oops on both 3.0.4 and 3.1.0. Also, x86_64 appears to
be
> > > ok.
> > >
> > > I''m guessing this issue is the same as the oops reported
here:
> > > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=975
> > >
> > > Below is an example of the oops on my 2.6.18 pae kernel with a
couple
> > > extra debuging lines added:
> >
> > Looks like xen_l1_entry_update() is passed a virtual address which has
no
> > corresponding machine address. So the pte page or its mapping is
corrupted
> > somehow. deadbeef in the register dumps is also not a good sign.
I''ll have a
> > go at repro''ing.
> >
> >  -- Keir
> >
> >
> >
>
> As for the deadbeef, I''m kind of doubt it is important. Those
values
> show up after the hypercall to xen. Using the attached patch which
> checks for the bogus value prior to the call I get the following oops:
>
> virtptr: f57b40c0 machineptr: 7fffffff0c0
> ------------[ cut here ]------------
> kernel BUG at arch/i386/mm/hypervisor.c:64!
> invalid opcode: 0000 [#1]
> SMP
> Modules linked in:
> CPU:    0
> EIP:    0061:[<c0117893>]    Not tainted VLI
> EFLAGS: 00010286   (2.6.18-xen-r5-try2 #10)
> EIP is at xen_l1_entry_update+0xd7/0x100
> eax: 0000002d   ebx: 00000000   ecx: 00000000   edx: 00000001
> esi: fffff0c0   edi: 000007ff   ebp: ed45cd10   esp: ed45ccd8
> ds: 007b   es: 007b   ss: 0069
> Process bash (pid: 5044, ti=ed45c000 task=ec835a70 task.ti=ed45c000)
> Stack: c037b964 f57b40c0 fffff0c0 000007ff 00000000 00000000 f57b40c0
fffff0c0
>        000007ff 00000000 00000000 00000000 00000000 00000000 ed45cd84
c01586b7
>        35371025 00000000 ecd95ec0 ecd95f08 c04bce70 00000000 00000004
00000000
> Call Trace:
>  [<c01586b7>] zap_pte_range+0x265/0x658
>  [<c0158c16>] unmap_page_range+0x16c/0x2b4
>  [<c0158e2c>] unmap_vmas+0xce/0x1cb
>  [<c015f0b8>] exit_mmap+0x7d/0xf4
>  [<c011e0f3>] mmput+0x36/0x8c
>  [<c01782d3>] exec_mmap+0x156/0x229
>  [<c0178a78>] flush_old_exec+0x59/0x25a
>  [<c0198a18>] load_elf_binary+0x33c/0xc52
>  [<c0178f2a>] search_binary_handler+0x89/0x23c
>  [<c017922f>] do_execve+0x152/0x1be
>  [<c010391c>] sys_execve+0x32/0x84
>  [<c0104dfb>] syscall_call+0x7/0xb
>  [<b7efd899>] 0xb7efd899
> Code: b4 97 fe ff 85 c0 78 42 83 c4 2c 5b 5e 5f 5d c3 8b 45 e0 89 74
> 24 08 89 7c 24 0
> EIP: [<c0117893>] xen_l1_entry_update+0xd7/0x100 SS:ESP 0069:ed45ccd8
I can still reproduce this problem on the 3.1.1-rc2 xen kernel. Has
anyone had a chance to take a look at this or try to reproduce it? I
can reproduce this far to easily :-(

Is there any further debugging information I can provide?

-- 
Michael Marineau
Oregon State University
mike@marineau.org

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Oct-04 09:35 UTC

head link

Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel

On 3/10/07 21:39, "Michael Marineau" <mike@marineau.org> wrote:
>> Code: b4 97 fe ff 85 c0 78 42 83 c4 2c 5b 5e 5f 5d c3 8b 45 e0 89 74
>> 24 08 89 7c 24 0
>> EIP: [<c0117893>] xen_l1_entry_update+0xd7/0x100 SS:ESP
0069:ed45ccd8
> 
> I can still reproduce this problem on the 3.1.1-rc2 xen kernel. Has
> anyone had a chance to take a look at this or try to reproduce it? I
> can reproduce this far to easily :-(
I''ll need to find a test box with more than 2GB of memory...

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Sep 2007 - Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel

[Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel

Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel

Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel

Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel

Re: [Xen-devel] Hunting down an oops in Xen 3.1.0''s 2.6.18 kernel