Ian Pratt
2006-Oct-04 13:08 UTC
RE: [Xen-devel] Kernel BUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197
> When running on 4GB of total memory instead of 12GB, > everything is just fine. (the three virtual machines, Dom0 + > 2 x DomU are assigned 1GB of memory each, in both test runs). > Does that help?Is this with the kernel and xen from -unstabele/3.0.3 ? Have you changed the config? What storage device do you have? What NIC? Are you setting mem=4096M on the Xen command line? If you removed DIMMs to get 4GB in the machine some of the memory willl still be mapped above 4GB. It seems hard to imagine this is a lurking 4GB issue (especially on x86_64 rather tha PAE). Ian> If you have any ideas where I should do more debugging, > please tell me. > We would really like to get this machine going. > > > Oct 3 23:27:28 tuek BUG: soft lockup detected on CPU#0! > > Oct 3 23:27:28 tuek CPU 0: > > Oct 3 23:27:28 tuek Modules linked in: nfsd exportfs Oct > 3 23:27:28 > > tuek Pid: 3988, comm: gmetad Not tainted 2.6.16.29-xen-xenU > #2 Oct 3 > > 23:27:28 tuek RIP: e030:[<ffffffff8010722a>] > > <ffffffff8010722a>{hypercall_page+554} > > Oct 3 23:27:28 tuek RSP: e02b:ffff88003e32f9e0 EFLAGS: > 00000246 Oct > > 3 23:27:28 tuek RAX: 0000000000030000 RBX: ffff8800017ea448 RCX: > > ffffffff8010722a Oct 3 23:27:28 tuek RDX: ffffffffff5fd000 RSI: > > 0000000000000000 RDI: 0000000000000000 Oct 3 23:27:28 tuek RBP: > > ffff88003e32f9f8 R08: 0000000000000000 R09: 0000000000000000 Oct 3 > > 23:27:28 tuek R10: 0000000000007ff0 R11: 0000000000000246 R12: > > 0000000000001000 Oct 3 23:27:28 tuek R13: ffff88003e32fd38 R14: > > 0000000000005000 R15: 0000000000000002 Oct 3 23:27:28 tuek FS: > > 00002aeaaa684b00(0000) GS:ffffffff804bf000(0000) > > knlGS:0000000000000000 Oct 3 23:27:28 tuek CS: e033 DS: 0000 ES: > > 0000 Oct 3 23:27:28 tuek Oct 3 23:27:28 tuek Call Trace: > > <ffffffff802dc47e>{force_evtchn_callback+14} > > Oct 3 23:27:28 tuek <ffffffff803d4ab6>{do_page_fault+214} > > <ffffffff8010b6fb>{error_exit+0} Oct 3 23:27:28 tuek > > <ffffffff8010b6fb>{error_exit+0} > > <ffffffff8014f50e>{file_read_actor+62} > > Oct 3 23:27:28 tuek <ffffffff8014f57c>{file_read_actor+172} > > <ffffffff8014d19c>{do_generic_mapping_read+412} > > Oct 3 23:27:28 tuek <ffffffff8014f4d0>{file_read_actor+0} > > <ffffffff8014dce8>{__generic_file_aio_read+424} > > Oct 3 23:27:28 tuek <ffffffff8014dd98>{generic_file_aio_read+56} > > <ffffffff801f8f51>{nfs_file_read+129} > > Oct 3 23:27:28 tuek <ffffffff80172dd0>{do_sync_read+240} > > <ffffffff80161981>{vma_link+129} Oct 3 23:27:28 tuek > > <ffffffff80140500>{autoremove_wake_function+0} > > <ffffffff80162b02>{do_mmap_pgoff+1458} > > Oct 3 23:27:28 tuek <ffffffff8017381b>{vfs_read+187} > > <ffffffff80173ce0>{sys_read+80} Oct 3 23:27:28 tuek > > <ffffffff8010afbe>{system_call+134} > <ffffffff8010af38>{system_call+0} > > > > Oct 3 23:27:52 tuek Bad page state in process ''bash'' > > Oct 3 23:27:52 tuek page:ffff880001c72bc8 flags:0x0000000000000000 > > mapping:0000000000000000 mapcount:1 count:1 Oct 3 23:27:52 tuek > > Trying to fix it up, but a reboot is needed Oct 3 23:27:52 > tuek Backtrace: > > Oct 3 23:27:52 tuek > > Oct 3 23:27:52 tuek Call Trace: <ffffffff801512ad>{bad_page+93} > > <ffffffff80151d57>{get_page_from_freelist+775} > > Oct 3 23:27:52 tuek <ffffffff80151f1d>{__alloc_pages+157} > > <ffffffff80152249>{get_zeroed_page+73} > > Oct 3 23:27:52 tuek <ffffffff80158cf4>{__pmd_alloc+36} > > <ffffffff8015e55e>{copy_page_range+1262} > > Oct 3 23:27:52 tuek <ffffffff802a6bea>{rb_insert_color+250} > > <ffffffff80127cb7>{copy_process+3079} > > Oct 3 23:27:52 tuek <ffffffff80128c8e>{do_fork+238} > > <ffffffff801710d6>{fd_install+54} Oct 3 23:27:52 tuek > > <ffffffff80134e8c>{sigprocmask+220} > > <ffffffff8010afbe>{system_call+134} > > Oct 3 23:27:52 tuek <ffffffff801094b3>{sys_clone+35} > > <ffffffff8010b3e9>{ptregscall_common+61} > > > > Oct 3 23:27:52 tuek ----------- [cut here ] --------- [please bite > > here ] --------- Oct 3 23:27:52 tuek Kernel BUG at > > arch/x86_64/mm/../../i386/mm/hypervisor.c:198 > > Oct 3 23:27:52 tuek invalid opcode: 0000 [1] SMP Oct 3 > 23:27:52 tuek > > CPU 3 Oct 3 23:27:52 tuek Modules linked in: nfsd exportfs > > Oct 3 23:27:52 tuek Pid: 4617, comm: bash Tainted: G B > 2.6.16.29-xen-xenU #2 > > Oct 3 23:27:52 tuek RIP: e030:[<ffffffff80117cb5>] > > <ffffffff80117cb5>{xen_pgd_pin+85} > > Oct 3 23:27:52 tuek RSP: e02b:ffff880038ed9d58 EFLAGS: > 00010282 Oct > > 3 23:27:52 tuek RAX: 00000000ffffffea RBX: ffff880000e098c0 RCX: > > 000000000001dc48 Oct 3 23:27:52 tuek RDX: 0000000000000000 RSI: > > 0000000000000001 RDI: ffff880038ed9d58 Oct 3 23:27:52 tuek RBP: > > ffff880038ed9d78 R08: ffff880038e7fff8 R09: ffff880038e7fff8 Oct 3 > > 23:27:52 tuek R10: 0000000000007ff0 R11: ffff880002d39008 R12: > > 0000000000000000 Oct 3 23:27:52 tuek R13: ffff8800006383c0 R14: > > 0000000001200011 R15: ffff8800006383c0 Oct 3 23:27:52 tuek FS: > > 00002afecc63ae60(0000) GS:ffffffff804bf180(0000) > > knlGS:0000000000000000 Oct 3 23:27:52 tuek CS: e033 DS: 0000 ES: > > 0000 Oct 3 23:27:52 tuek Process bash (pid: 4617, threadinfo > > ffff880038ed8000, task ffff88003f9e0180) Oct 3 23:27:52 > tuek Stack: > > 0000000000000003 00000000001b3aa7 0000000001200011 ffff880002d39008 > > Oct 3 23:27:52 tuek ffff880038ed9d98 ffffffff80117543 > > 0000000000000000 ffff88003ca4ea28 Oct 3 23:27:52 tuek > > ffff880038ed9da8 ffffffff801175f2 Oct 3 23:27:52 tuek Call Trace: > > <ffffffff80117543>{mm_pin+387} <ffffffff801175f2>{_arch_dup_mmap+18} > > Oct 3 23:27:52 tuek <ffffffff80127cf6>{copy_process+3142} > > <ffffffff80128c8e>{do_fork+238} Oct 3 23:27:52 tuek > > <ffffffff801710d6>{fd_install+54} > <ffffffff80134e8c>{sigprocmask+220} > > Oct 3 23:27:52 tuek <ffffffff8010afbe>{system_call+134} > > <ffffffff801094b3>{sys_clone+35} Oct 3 23:27:52 tuek > > <ffffffff8010b3e9>{ptregscall_common+61} > > Oct 3 23:27:52 tuek > > Oct 3 23:27:52 tuek Code: 0f 0b 68 38 d7 3f 80 c2 c6 00 90 > c9 c3 0f > > 1f 80 00 00 00 00 Oct 3 23:27:52 tuek RIP > > <ffffffff80117cb5>{xen_pgd_pin+85} RSP <ffff880038ed9d58> > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christophe Saout
2006-Oct-04 13:30 UTC
RE: [Xen-devel] Kernel BUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197
Am Mittwoch, den 04.10.2006, 14:08 +0100 schrieb Ian Pratt:> > When running on 4GB of total memory instead of 12GB, > > everything is just fine. (the three virtual machines, Dom0 + > > 2 x DomU are assigned 1GB of memory each, in both test runs). > > Does that help? > > Is this with the kernel and xen from -unstabele/3.0.3?Yes.> Have you changed the config?The XEN and CPU config options are identical to the configs that come as defaults. Just a lot of devices drivers are not compiled in.> What storage device do you have? What NIC?The hard disks are attached to two 3Ware/AMCC SATA storage controllers (9550SXU-8L), the NIC is an intel PRO/1000. When crashing the system, I am not involving the NIC, just traffic on the internal bridge.> Are you setting mem=4096M on the Xen command line? If you removed DIMMs > to get 4GB in the machine some of the memory willl still be mapped above > 4GB.Yes, I removed the DIMMs. I''m just testing with 8GB. Should I try limiting the memory with mem as well?> It seems hard to imagine this is a lurking 4GB issue (especially on > x86_64 rather tha PAE).Yes, possibly. We will also test some BIOS options related to memory. I''ll give you feedback if we figure something out. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christophe Saout
2006-Oct-04 21:15 UTC
RE: [Xen-devel] Kernel BUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197
Hello Ian,> > When running on 4GB of total memory instead of 12GB, > > everything is just fine. (the three virtual machines, Dom0 + > > 2 x DomU are assigned 1GB of memory each, in both test runs). > > Does that help? > > Are you setting mem=4096M on the Xen command line? If you removed DIMMs > to get 4GB in the machine some of the memory willl still be mapped above > 4GB. > > It seems hard to imagine this is a lurking 4GB issue (especially on > x86_64 rather tha PAE).The good news is that we were able to fix this problem by changing BIOS settings concerning "memory holes". There were two settings "hardware memory hole" and "software memory hole" that "enable software/hardware remapping around memory hole", whatever that is. They were both turned on by default, and I just turned them off. I didn''t see any downsides except that I''m unable to crash the machine any more. It''s surviving my stress tests for several hours now without crashes. The BIOS help also says that the "hardware memory hole" only works on REV E0 processors, so perhaps this configures some weird mapping that Xen doesn''t understand? Anyway, I''ll stick with this setting now, given that it just works. Sorry for all the confusion. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2006-Oct-05 04:02 UTC
RE: [Xen-devel] KernelBUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197
> The good news is that we were able to fix this problem by changingBIOS> settings concerning "memory holes". There were two settings "hardware > memory hole" and "software memory hole" that "enable software/hardware > remapping around memory hole", whatever that is. They were both turned > on by default, and I just turned them off. I didn''t see any downsides > except that I''m unable to crash the machine any more. It''s survivingmy> stress tests for several hours now without crashes. > > The BIOS help also says that the "hardware memory hole" only works on > REV E0 processors, so perhaps this configures some weird mapping that > Xen doesn''t understand? Anyway, I''ll stick with this setting now,given> that it just works.Glad it works for you, but I wish we understood what was going on a bit more. It may be that the bios is just borked and the e820 map it gives xen misses some regions that it steals for other purposes. It would be pretty surprising if Xen had bugs in its e820 code. It might be interesting to post the xm dmesg output with the two different BIOS settings to see if there''s anything unusual about the e820 map. Might be worth comparing against what Linux prints too. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2006-Oct-05 07:42 UTC
Re: [Xen-devel] KernelBUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197
On 5/10/06 5:02 am, "Ian Pratt" <m+Ian.Pratt@cl.cam.ac.uk> wrote:>> The BIOS help also says that the "hardware memory hole" only works on >> REV E0 processors, so perhaps this configures some weird mapping that >> Xen doesn''t understand? Anyway, I''ll stick with this setting now, > given >> that it just works. > > Glad it works for you, but I wish we understood what was going on a bit > more. It may be that the bios is just borked and the e820 map it gives > xen misses some regions that it steals for other purposes. It would be > pretty surprising if Xen had bugs in its e820 code. > > It might be interesting to post the xm dmesg output with the two > different BIOS settings to see if there''s anything unusual about the > e820 map. Might be worth comparing against what Linux prints too.Older Opterons couldn''t remap DRAM around the I/O memory region. That limitation went away some time ago though, and I expect dual-core chips should all have a memory controller that support DRAM remapping. The issue here is more likely that the BIOS is basket case. You might try upgrading the BIOS and see if that helps. The downside of software memory hole appears to be that remapped RAM accesses are apparently ''emulated'', which doesn''t sound fast! And if you specify no hole at all, you will lose around 512MB of memory. Look around on Google for complaints about "hardware memory hole" causing problems for people. There are plenty! You seem unlucky that your issue is as hard to reproduce as it is. Many people can''t even boot. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christophe Saout
2006-Oct-05 14:18 UTC
RE: [Xen-devel] KernelBUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197
Am Donnerstag, den 05.10.2006, 05:02 +0100 schrieb Ian Pratt:> Glad it works for you, but I wish we understood what was going on a bit > more. It may be that the bios is just borked and the e820 map it gives > xen misses some regions that it steals for other purposes. It would be > pretty surprising if Xen had bugs in its e820 code. > > It might be interesting to post the xm dmesg output with the two > different BIOS settings to see if there''s anything unusual about the > e820 map.The only difference is in the Physical RAM map: Broken (with memory hole remapping turned on): (XEN) Physical RAM map: (XEN) 0000000000000000 - 000000000009fc00 (usable) (XEN) 000000000009fc00 - 00000000000a0000 (reserved) (XEN) 00000000000e8000 - 0000000000100000 (reserved) (XEN) 0000000000100000 - 00000000bfff0000 (usable) (XEN) 00000000bfff0000 - 00000000bffff000 (ACPI data) (XEN) 00000000bffff000 - 00000000c0000000 (ACPI NVS) (XEN) 00000000ff780000 - 0000000100000000 (reserved) (XEN) 0000000100000000 - 000000030e000000 (usable) (XEN) System RAM: 11487MB (11763260kB) Working (with memory hole remapping turned off): (XEN) Physical RAM map: (XEN) 0000000000000000 - 000000000009fc00 (usable) (XEN) 000000000009fc00 - 00000000000a0000 (reserved) (XEN) 00000000000e8000 - 0000000000100000 (reserved) (XEN) 0000000000100000 - 00000000efff0000 (usable) (XEN) 00000000efff0000 - 00000000effff000 (ACPI data) (XEN) 00000000effff000 - 00000000f0000000 (ACPI NVS) (XEN) 00000000ff780000 - 0000000100000000 (reserved) (XEN) 0000000100000000 - 0000000300000000 (usable) (XEN) System RAM: 12031MB (12320316kB) The strange thing is that the upper configuration shows even less memory that the second one which has only 256MB missing?> Might be worth comparing against what Linux prints too.Ok, I''ll try to boot the Dom0 without hypervisor to get some numbers from native Linux for comparison. As Keir suggested, this really might be a BIOS bug. Our hardware vendor has notified the motherboard manufacturer to have this checked (this already was the latest version). _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Petersson, Mats
2006-Oct-05 16:38 UTC
RE: [Xen-devel] KernelBUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197
> -----Original Message----- > From: xen-devel-bounces@lists.xensource.com > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of > Christophe Saout > Sent: 05 October 2006 15:19 > To: Ian Pratt > Cc: xen-devel@lists.xensource.com > Subject: RE: [Xen-devel] > KernelBUGatarch/x86_64/mm/../../i386/mm/hypervisor.c:197 > > Am Donnerstag, den 05.10.2006, 05:02 +0100 schrieb Ian Pratt: > > > Glad it works for you, but I wish we understood what was > going on a bit > > more. It may be that the bios is just borked and the e820 > map it gives > > xen misses some regions that it steals for other purposes. > It would be > > pretty surprising if Xen had bugs in its e820 code. > > > > It might be interesting to post the xm dmesg output with the two > > different BIOS settings to see if there''s anything unusual about the > > e820 map. > > The only difference is in the Physical RAM map: > > Broken (with memory hole remapping turned on): > > (XEN) Physical RAM map: > (XEN) 0000000000000000 - 000000000009fc00 (usable) > (XEN) 000000000009fc00 - 00000000000a0000 (reserved) > (XEN) 00000000000e8000 - 0000000000100000 (reserved) > (XEN) 0000000000100000 - 00000000bfff0000 (usable) > (XEN) 00000000bfff0000 - 00000000bffff000 (ACPI data) > (XEN) 00000000bffff000 - 00000000c0000000 (ACPI NVS)There is a HOLE here - c0000000 to ff780000 is "missing". That''s 1GB minus a little bit.> (XEN) 00000000ff780000 - 0000000100000000 (reserved) > (XEN) 0000000100000000 - 000000030e000000 (usable) > (XEN) System RAM: 11487MB (11763260kB) > > Working (with memory hole remapping turned off): > > (XEN) Physical RAM map: > (XEN) 0000000000000000 - 000000000009fc00 (usable) > (XEN) 000000000009fc00 - 00000000000a0000 (reserved) > (XEN) 00000000000e8000 - 0000000000100000 (reserved) > (XEN) 0000000000100000 - 00000000efff0000 (usable)This area is bigger, which probably explains the more usable memory.> (XEN) 00000000efff0000 - 00000000effff000 (ACPI data) > (XEN) 00000000effff000 - 00000000f0000000 (ACPI NVS)The hole here is much smaller... Only f0000000 - ff780000, around 256MB if my mental arithmetic isn''t playing up (which it does quite frequently).> (XEN) 00000000ff780000 - 0000000100000000 (reserved) > (XEN) 0000000100000000 - 0000000300000000 (usable) > (XEN) System RAM: 12031MB (12320316kB) > > The strange thing is that the upper configuration shows even > less memory > that the second one which has only 256MB missing?That may be explained by the above comments - but I can''t explain what''s going wrong in Xen with this... -- Mats> > > Might be worth comparing against what Linux prints too. > > Ok, I''ll try to boot the Dom0 without hypervisor to get some numbers > from native Linux for comparison. > > As Keir suggested, this really might be a BIOS bug. Our > hardware vendor > has notified the motherboard manufacturer to have this checked (this > already was the latest version). > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel