thr3ads.net - Xen devel - RE: [Xen-devel] Kernel BUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197 [Oct 2006]

If this information is useful, please help other people find it:
Share via:

Ian Pratt

2006-Oct-04 13:08 UTC

RE: [Xen-devel] Kernel BUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197

> When running on 4GB of total memory instead of 12GB, 
> everything is just fine. (the three virtual machines, Dom0 + 
> 2 x DomU are assigned 1GB of memory each, in both test runs). 
> Does that help?
Is this with the kernel and xen from -unstabele/3.0.3 ?  Have you
changed the config?
What storage device do you have? What NIC? 

Are you setting mem=4096M on the Xen command line? If you removed DIMMs
to get 4GB in the machine some of the memory willl still be mapped above
4GB.

It seems hard to imagine this is a lurking 4GB issue (especially on
x86_64 rather tha PAE).

Ian
 > If you have any ideas where I should do more debugging, 
> please tell me.
> We would really like to get this machine going.
> 
> > Oct  3 23:27:28 tuek BUG: soft lockup detected on CPU#0!
> > Oct  3 23:27:28 tuek CPU 0:
> > Oct  3 23:27:28 tuek Modules linked in: nfsd exportfs Oct  
> 3 23:27:28 
> > tuek Pid: 3988, comm: gmetad Not tainted 2.6.16.29-xen-xenU 
> #2 Oct  3 
> > 23:27:28 tuek RIP: e030:[<ffffffff8010722a>] 
> > <ffffffff8010722a>{hypercall_page+554}
> > Oct  3 23:27:28 tuek RSP: e02b:ffff88003e32f9e0  EFLAGS: 
> 00000246 Oct  
> > 3 23:27:28 tuek RAX: 0000000000030000 RBX: ffff8800017ea448 RCX: 
> > ffffffff8010722a Oct  3 23:27:28 tuek RDX: ffffffffff5fd000 RSI: 
> > 0000000000000000 RDI: 0000000000000000 Oct  3 23:27:28 tuek RBP: 
> > ffff88003e32f9f8 R08: 0000000000000000 R09: 0000000000000000 Oct  3 
> > 23:27:28 tuek R10: 0000000000007ff0 R11: 0000000000000246 R12: 
> > 0000000000001000 Oct  3 23:27:28 tuek R13: ffff88003e32fd38 R14: 
> > 0000000000005000 R15: 0000000000000002 Oct  3 23:27:28 tuek FS:  
> > 00002aeaaa684b00(0000) GS:ffffffff804bf000(0000) 
> > knlGS:0000000000000000 Oct  3 23:27:28 tuek CS:  e033 DS: 0000 ES: 
> > 0000 Oct  3 23:27:28 tuek Oct  3 23:27:28 tuek Call Trace: 
> > <ffffffff802dc47e>{force_evtchn_callback+14}
> > Oct  3 23:27:28 tuek <ffffffff803d4ab6>{do_page_fault+214} 
> > <ffffffff8010b6fb>{error_exit+0} Oct  3 23:27:28 tuek 
> > <ffffffff8010b6fb>{error_exit+0} 
> > <ffffffff8014f50e>{file_read_actor+62}
> > Oct  3 23:27:28 tuek <ffffffff8014f57c>{file_read_actor+172} 
> > <ffffffff8014d19c>{do_generic_mapping_read+412}
> > Oct  3 23:27:28 tuek <ffffffff8014f4d0>{file_read_actor+0} 
> > <ffffffff8014dce8>{__generic_file_aio_read+424}
> > Oct  3 23:27:28 tuek
<ffffffff8014dd98>{generic_file_aio_read+56}
> > <ffffffff801f8f51>{nfs_file_read+129}
> > Oct  3 23:27:28 tuek <ffffffff80172dd0>{do_sync_read+240} 
> > <ffffffff80161981>{vma_link+129} Oct  3 23:27:28 tuek 
> > <ffffffff80140500>{autoremove_wake_function+0} 
> > <ffffffff80162b02>{do_mmap_pgoff+1458}
> > Oct  3 23:27:28 tuek <ffffffff8017381b>{vfs_read+187} 
> > <ffffffff80173ce0>{sys_read+80} Oct  3 23:27:28 tuek 
> > <ffffffff8010afbe>{system_call+134} 
> <ffffffff8010af38>{system_call+0}
> > 
> > Oct  3 23:27:52 tuek Bad page state in process
''bash''
> > Oct  3 23:27:52 tuek page:ffff880001c72bc8 flags:0x0000000000000000 
> > mapping:0000000000000000 mapcount:1 count:1 Oct  3 23:27:52 tuek 
> > Trying to fix it up, but a reboot is needed Oct  3 23:27:52 
> tuek Backtrace:
> > Oct  3 23:27:52 tuek
> > Oct  3 23:27:52 tuek Call Trace: <ffffffff801512ad>{bad_page+93}
> > <ffffffff80151d57>{get_page_from_freelist+775}
> > Oct  3 23:27:52 tuek <ffffffff80151f1d>{__alloc_pages+157} 
> > <ffffffff80152249>{get_zeroed_page+73}
> > Oct  3 23:27:52 tuek <ffffffff80158cf4>{__pmd_alloc+36} 
> > <ffffffff8015e55e>{copy_page_range+1262}
> > Oct  3 23:27:52 tuek <ffffffff802a6bea>{rb_insert_color+250} 
> > <ffffffff80127cb7>{copy_process+3079}
> > Oct  3 23:27:52 tuek <ffffffff80128c8e>{do_fork+238} 
> > <ffffffff801710d6>{fd_install+54} Oct  3 23:27:52 tuek 
> > <ffffffff80134e8c>{sigprocmask+220} 
> > <ffffffff8010afbe>{system_call+134}
> > Oct  3 23:27:52 tuek <ffffffff801094b3>{sys_clone+35} 
> > <ffffffff8010b3e9>{ptregscall_common+61}
> > 
> > Oct  3 23:27:52 tuek ----------- [cut here ] --------- [please bite 
> > here ] --------- Oct  3 23:27:52 tuek Kernel BUG at 
> > arch/x86_64/mm/../../i386/mm/hypervisor.c:198
> > Oct  3 23:27:52 tuek invalid opcode: 0000 [1] SMP Oct  3 
> 23:27:52 tuek 
> > CPU 3 Oct  3 23:27:52 tuek Modules linked in: nfsd exportfs
> > Oct  3 23:27:52 tuek Pid: 4617, comm: bash Tainted: G    B 
> 2.6.16.29-xen-xenU #2
> > Oct  3 23:27:52 tuek RIP: e030:[<ffffffff80117cb5>] 
> > <ffffffff80117cb5>{xen_pgd_pin+85}
> > Oct  3 23:27:52 tuek RSP: e02b:ffff880038ed9d58  EFLAGS: 
> 00010282 Oct  
> > 3 23:27:52 tuek RAX: 00000000ffffffea RBX: ffff880000e098c0 RCX: 
> > 000000000001dc48 Oct  3 23:27:52 tuek RDX: 0000000000000000 RSI: 
> > 0000000000000001 RDI: ffff880038ed9d58 Oct  3 23:27:52 tuek RBP: 
> > ffff880038ed9d78 R08: ffff880038e7fff8 R09: ffff880038e7fff8 Oct  3 
> > 23:27:52 tuek R10: 0000000000007ff0 R11: ffff880002d39008 R12: 
> > 0000000000000000 Oct  3 23:27:52 tuek R13: ffff8800006383c0 R14: 
> > 0000000001200011 R15: ffff8800006383c0 Oct  3 23:27:52 tuek FS:  
> > 00002afecc63ae60(0000) GS:ffffffff804bf180(0000) 
> > knlGS:0000000000000000 Oct  3 23:27:52 tuek CS:  e033 DS: 0000 ES: 
> > 0000 Oct  3 23:27:52 tuek Process bash (pid: 4617, threadinfo 
> > ffff880038ed8000, task ffff88003f9e0180) Oct  3 23:27:52 
> tuek Stack: 
> > 0000000000000003 00000000001b3aa7 0000000001200011 ffff880002d39008 
> > Oct  3 23:27:52 tuek ffff880038ed9d98 ffffffff80117543 
> > 0000000000000000 ffff88003ca4ea28 Oct  3 23:27:52 tuek 
> > ffff880038ed9da8 ffffffff801175f2 Oct  3 23:27:52 tuek Call Trace: 
> > <ffffffff80117543>{mm_pin+387}
<ffffffff801175f2>{_arch_dup_mmap+18}
> > Oct  3 23:27:52 tuek <ffffffff80127cf6>{copy_process+3142} 
> > <ffffffff80128c8e>{do_fork+238} Oct  3 23:27:52 tuek 
> > <ffffffff801710d6>{fd_install+54} 
> <ffffffff80134e8c>{sigprocmask+220}
> > Oct  3 23:27:52 tuek <ffffffff8010afbe>{system_call+134} 
> > <ffffffff801094b3>{sys_clone+35} Oct  3 23:27:52 tuek 
> > <ffffffff8010b3e9>{ptregscall_common+61}
> > Oct  3 23:27:52 tuek
> > Oct  3 23:27:52 tuek Code: 0f 0b 68 38 d7 3f 80 c2 c6 00 90 
> c9 c3 0f 
> > 1f 80 00 00 00 00 Oct  3 23:27:52 tuek RIP 
> > <ffffffff80117cb5>{xen_pgd_pin+85} RSP <ffff880038ed9d58>
> 
> 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Christophe Saout

2006-Oct-04 13:30 UTC

head link

RE: [Xen-devel] Kernel BUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197

Am Mittwoch, den 04.10.2006, 14:08 +0100 schrieb Ian Pratt:
> > When running on 4GB of total memory instead of 12GB, 
> > everything is just fine. (the three virtual machines, Dom0 + 
> > 2 x DomU are assigned 1GB of memory each, in both test runs). 
> > Does that help?
> 
> Is this with the kernel and xen from -unstabele/3.0.3?
Yes.
> Have you changed the config?
The XEN and CPU config options are identical to the configs that come as
defaults. Just a lot of devices drivers are not compiled in.
> What storage device do you have? What NIC? 
The hard disks are attached to two 3Ware/AMCC SATA storage controllers
(9550SXU-8L), the NIC is an intel PRO/1000.

When crashing the system, I am not involving the NIC, just traffic on
the internal bridge.
> Are you setting mem=4096M on the Xen command line? If you removed DIMMs
> to get 4GB in the machine some of the memory willl still be mapped above
> 4GB.
Yes, I removed the DIMMs. I''m just testing with 8GB. Should I try
limiting the memory with mem as well?
> It seems hard to imagine this is a lurking 4GB issue (especially on
> x86_64 rather tha PAE).
Yes, possibly. We will also test some BIOS options related to memory.
I''ll give you feedback if we figure something out.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Christophe Saout

2006-Oct-04 21:15 UTC

head link

RE: [Xen-devel] Kernel BUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197

Hello Ian,
> > When running on 4GB of total memory instead of 12GB, 
> > everything is just fine. (the three virtual machines, Dom0 + 
> > 2 x DomU are assigned 1GB of memory each, in both test runs). 
> > Does that help?
> 
> Are you setting mem=4096M on the Xen command line? If you removed DIMMs
> to get 4GB in the machine some of the memory willl still be mapped above
> 4GB.
> 
> It seems hard to imagine this is a lurking 4GB issue (especially on
> x86_64 rather tha PAE).
The good news is that we were able to fix this problem by changing BIOS
settings concerning "memory holes". There were two settings
"hardware
memory hole" and "software memory hole" that "enable
software/hardware
remapping around memory hole", whatever that is. They were both turned
on by default, and I just turned them off. I didn''t see any downsides
except that I''m unable to crash the machine any more. It''s
surviving my
stress tests for several hours now without crashes.

The BIOS help also says that the "hardware memory hole" only works on
REV E0 processors, so perhaps this configures some weird mapping that
Xen doesn''t understand? Anyway, I''ll stick with this setting
now, given
that it just works.

Sorry for all the confusion.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Pratt

2006-Oct-05 04:02 UTC

head link

RE: [Xen-devel] KernelBUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197

> The good news is that we were able to fix this problem by changing
BIOS> settings concerning "memory holes". There were two settings
"hardware
> memory hole" and "software memory hole" that "enable
software/hardware
> remapping around memory hole", whatever that is. They were both turned
> on by default, and I just turned them off. I didn''t see any
downsides
> except that I''m unable to crash the machine any more.
It''s surviving
my> stress tests for several hours now without crashes.
> 
> The BIOS help also says that the "hardware memory hole" only
works on
> REV E0 processors, so perhaps this configures some weird mapping that
> Xen doesn''t understand? Anyway, I''ll stick with this
setting now,
given> that it just works.
Glad it works for you, but I wish we understood what was going on a bit
more. It may be that the bios is just borked and the e820 map it gives
xen misses some regions that it steals for other purposes. It would be
pretty surprising if Xen had bugs in its e820 code. 

It might be interesting to post the xm dmesg output with the two
different BIOS settings to see if there''s anything unusual about the
e820 map. Might be worth comparing against what Linux prints too.

Ian 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2006-Oct-05 07:42 UTC

head link

Re: [Xen-devel] KernelBUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197

On 5/10/06 5:02 am, "Ian Pratt" <m+Ian.Pratt@cl.cam.ac.uk>
wrote:
>> The BIOS help also says that the "hardware memory hole" only
works on
>> REV E0 processors, so perhaps this configures some weird mapping that
>> Xen doesn''t understand? Anyway, I''ll stick with this
setting now,
> given
>> that it just works.
> 
> Glad it works for you, but I wish we understood what was going on a bit
> more. It may be that the bios is just borked and the e820 map it gives
> xen misses some regions that it steals for other purposes. It would be
> pretty surprising if Xen had bugs in its e820 code.
> 
> It might be interesting to post the xm dmesg output with the two
> different BIOS settings to see if there''s anything unusual about
the
> e820 map. Might be worth comparing against what Linux prints too.
Older Opterons couldn''t remap DRAM around the I/O memory region. That
limitation went away some time ago though, and I expect dual-core chips
should all have a memory controller that support DRAM remapping.

The issue here is more likely that the BIOS is basket case. You might try
upgrading the BIOS and see if that helps. The downside of software memory
hole appears to be that remapped RAM accesses are apparently
''emulated'',
which doesn''t sound fast! And if you specify no hole at all, you will
lose
around 512MB of memory.

Look around on Google for complaints about "hardware memory hole"
causing
problems for people. There are plenty! You seem unlucky that your issue is
as hard to reproduce as it is. Many people can''t even boot.

 -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Christophe Saout

2006-Oct-05 14:18 UTC

head link

RE: [Xen-devel] KernelBUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197

Am Donnerstag, den 05.10.2006, 05:02 +0100 schrieb Ian Pratt:
> Glad it works for you, but I wish we understood what was going on a bit
> more. It may be that the bios is just borked and the e820 map it gives
> xen misses some regions that it steals for other purposes. It would be
> pretty surprising if Xen had bugs in its e820 code. 
> 
> It might be interesting to post the xm dmesg output with the two
> different BIOS settings to see if there''s anything unusual about
the
> e820 map.
The only difference is in the Physical RAM map:

Broken (with memory hole remapping turned on):

(XEN) Physical RAM map:
(XEN)  0000000000000000 - 000000000009fc00 (usable)
(XEN)  000000000009fc00 - 00000000000a0000 (reserved)
(XEN)  00000000000e8000 - 0000000000100000 (reserved)
(XEN)  0000000000100000 - 00000000bfff0000 (usable)
(XEN)  00000000bfff0000 - 00000000bffff000 (ACPI data)
(XEN)  00000000bffff000 - 00000000c0000000 (ACPI NVS)
(XEN)  00000000ff780000 - 0000000100000000 (reserved)
(XEN)  0000000100000000 - 000000030e000000 (usable)
(XEN) System RAM: 11487MB (11763260kB)

Working (with memory hole remapping turned off):

(XEN) Physical RAM map:
(XEN)  0000000000000000 - 000000000009fc00 (usable)
(XEN)  000000000009fc00 - 00000000000a0000 (reserved)
(XEN)  00000000000e8000 - 0000000000100000 (reserved)
(XEN)  0000000000100000 - 00000000efff0000 (usable)
(XEN)  00000000efff0000 - 00000000effff000 (ACPI data)
(XEN)  00000000effff000 - 00000000f0000000 (ACPI NVS)
(XEN)  00000000ff780000 - 0000000100000000 (reserved)
(XEN)  0000000100000000 - 0000000300000000 (usable)
(XEN) System RAM: 12031MB (12320316kB)

The strange thing is that the upper configuration shows even less memory
that the second one which has only 256MB missing?
>  Might be worth comparing against what Linux prints too.
Ok, I''ll try to boot the Dom0 without hypervisor to get some numbers
from native Linux for comparison.

As Keir suggested, this really might be a BIOS bug. Our hardware vendor
has notified the motherboard manufacturer to have this checked (this
already was the latest version).



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Petersson, Mats

2006-Oct-05 16:38 UTC

head link

RE: [Xen-devel] KernelBUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197

> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com 
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
> Christophe Saout
> Sent: 05 October 2006 15:19
> To: Ian Pratt
> Cc: xen-devel@lists.xensource.com
> Subject: RE: [Xen-devel] 
> KernelBUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197
> 
> Am Donnerstag, den 05.10.2006, 05:02 +0100 schrieb Ian Pratt:
> 
> > Glad it works for you, but I wish we understood what was 
> going on a bit
> > more. It may be that the bios is just borked and the e820 
> map it gives
> > xen misses some regions that it steals for other purposes. 
> It would be
> > pretty surprising if Xen had bugs in its e820 code. 
> > 
> > It might be interesting to post the xm dmesg output with the two
> > different BIOS settings to see if there''s anything unusual
about the
> > e820 map.
> 
> The only difference is in the Physical RAM map:
> 
> Broken (with memory hole remapping turned on):
> 
> (XEN) Physical RAM map:
> (XEN)  0000000000000000 - 000000000009fc00 (usable)
> (XEN)  000000000009fc00 - 00000000000a0000 (reserved)
> (XEN)  00000000000e8000 - 0000000000100000 (reserved)
> (XEN)  0000000000100000 - 00000000bfff0000 (usable)
> (XEN)  00000000bfff0000 - 00000000bffff000 (ACPI data)
> (XEN)  00000000bffff000 - 00000000c0000000 (ACPI NVS)There is a HOLE here - c0000000 to ff780000 is "missing".
That''s 1GB
minus a little bit. > (XEN)  00000000ff780000 - 0000000100000000 (reserved)
> (XEN)  0000000100000000 - 000000030e000000 (usable)
> (XEN) System RAM: 11487MB (11763260kB)
> 
> Working (with memory hole remapping turned off):
> 
> (XEN) Physical RAM map:
> (XEN)  0000000000000000 - 000000000009fc00 (usable)
> (XEN)  000000000009fc00 - 00000000000a0000 (reserved)
> (XEN)  00000000000e8000 - 0000000000100000 (reserved)
> (XEN)  0000000000100000 - 00000000efff0000 (usable)This area is bigger, which probably explains the more usable memory.
> (XEN)  00000000efff0000 - 00000000effff000 (ACPI data)
> (XEN)  00000000effff000 - 00000000f0000000 (ACPI NVS)The hole here is much smaller... Only f0000000 - ff780000, around 256MB
if my mental arithmetic isn''t playing up (which it does quite
frequently). > (XEN)  00000000ff780000 - 0000000100000000 (reserved)
> (XEN)  0000000100000000 - 0000000300000000 (usable)
> (XEN) System RAM: 12031MB (12320316kB)
> 
> The strange thing is that the upper configuration shows even 
> less memory
> that the second one which has only 256MB missing?
That may be explained by the above comments - but I can''t explain
what''s
going wrong in Xen with this... 

--
Mats> 
> >  Might be worth comparing against what Linux prints too.
> 
> Ok, I''ll try to boot the Dom0 without hypervisor to get some
numbers
> from native Linux for comparison.
> 
> As Keir suggested, this really might be a BIOS bug. Our 
> hardware vendor
> has notified the motherboard manufacturer to have this checked (this
> already was the latest version).
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Apparently Analagous Threads

Search for more maybe matching threads

Xen devel - Oct 2006 - RE: Kernel BUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197

RE: [Xen-devel] Kernel BUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197

RE: [Xen-devel] Kernel BUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197

RE: [Xen-devel] Kernel BUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197

RE: [Xen-devel] KernelBUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197

Re: [Xen-devel] KernelBUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197

RE: [Xen-devel] KernelBUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197

RE: [Xen-devel] KernelBUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197

Apparently Analagous Threads