Hej folks, I''ve been trying to get a new machine up and running with the latest Xen for a while on a Slackware64 (current) machine. After installing Xen from source and building a new kernel with all xen options enabled I haven''t been able to get the machine to behave. The machine is a brand new dual opteron 6212 on a Supermicro H8DGi board with 64G ECC memory. Running a stock slackware kernel without xen works like a charm, haven''t seen anything weird. However, as soon as I boot Xen with my custom kernel the machine panics within the hour. When doing something intensive like building a kernel it''ll often crash in a few minutes. I''ve tried both Xen 4.3.0 and 4.3.1, no difference there. The kernels I''ve tried were 3.11.4 and 3.11.6 and the brand new 3.12. The kernel panics are a bit different every time, but the most common seems to be ''Bad page state in process X'' or ''unable to handle kernel paging request at X'' and of course ''general protection fault''. Here''s the most recent one: ------------------- Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034665] general protection fault: 0000 [#1] SMP Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034686] Modules linked in: Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034697] CPU: 0 PID: 262 Comm: jbd2/md0-8 Not tainted 3.12.0-Desman #1 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034707] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 3.0 09/10/2012 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034717] task: ffff8800d7162b20 ti: ffff8800d68b2000 task.ti: ffff8800d68b2000 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034726] RIP: e030:[<ffffffff8114119b>] [<ffffffff8114119b>] __rmqueue+0x6b/0x3a0 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034745] RSP: e02b:ffff8800d68b38b0 EFLAGS: 00010012 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034752] RAX: ffff8801281d9e08 RBX: 0000000000000000 RCX: 0000000000000003 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034797] RDX: 0000000000000001 RSI: ffff8801281d9f22 RDI: 9f30ffff8801281d Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034805] RBP: ffff8801281d9f02 R08: 0000000000000010 R09: 9f20ffff8801281d Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034814] R10: ffff8801281d9f10 R11: 0000000000000058 R12: ffff8801281d9d80 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034822] R13: 0000000000000001 R14: ffff8801281d9e00 R15: ffffea00046534e0 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034837] FS: 00007ff41d295740(0000) GS:ffff880122a00000(0000) knlGS:0000000000000000 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034847] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034854] CR2: 00007f4d5f308fb5 CR3: 000000011d045000 CR4: 0000000000040660 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034864] Stack: Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034869] 0000000000000000 ffff88011e4046c0 0000000000000001 0000000000001000 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034883] 0000000000000000 ffff8801281d9d80 0000000000000001 0000000000000000 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034897] 000000000000001f 0000000000000009 ffffea00046534e0 ffffffff81142f89 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034911] Call Trace: Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034922] [<ffffffff81142f89>] ? get_page_from_freelist+0x329/0x900 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034935] [<ffffffff811436b4>] ? __alloc_pages_nodemask+0x154/0xa90 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034948] [<ffffffff811539bd>] ? zone_statistics+0x9d/0xa0 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034961] [<ffffffff8117ddd3>] ? __kmalloc+0xe3/0x120 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034975] [<ffffffff81263d9a>] ? ext4_ext_find_extent+0x26a/0x300 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034987] [<ffffffff811749a5>] ? alloc_pages_current+0xb5/0x180 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.034999] [<ffffffff8117c3d5>] ? new_slab+0x255/0x2e0 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035011] [<ffffffff81d42807>] ? __slab_alloc+0x2a1/0x436 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035025] [<ffffffff811b8e18>] ? alloc_buffer_head+0x18/0x60 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035037] [<ffffffff8117db3b>] ? kmem_cache_alloc+0xab/0xd0 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035049] [<ffffffff811b8e18>] ? alloc_buffer_head+0x18/0x60 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035061] [<ffffffff811b8227>] ? generic_block_bmap+0x37/0x50 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035075] [<ffffffff81290676>] ? jbd2_journal_write_metadata_buffer+0x56/0x3c0 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035088] [<ffffffff8128aa41>] ? jbd2_journal_commit_transaction+0x721/0x16d0 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035103] [<ffffffff81007cbc>] ? xen_clocksource_read+0x1c/0x20 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035116] [<ffffffff81d4eed1>] ? _raw_spin_lock_irqsave+0x11/0x50 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035128] [<ffffffff8128e5df>] ? kjournald2+0xaf/0x240 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035140] [<ffffffff810cd2d0>] ? wake_up_atomic_t+0x30/0x30 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035152] [<ffffffff8128e530>] ? commit_timeout+0x10/0x10 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035163] [<ffffffff810cc57f>] ? kthread+0xaf/0xc0 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035174] [<ffffffff81007cbc>] ? xen_clocksource_read+0x1c/0x20 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035186] [<ffffffff810cc4d0>] ? kthread_create_on_node+0x120/0x120 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035197] [<ffffffff81d4facc>] ? ret_from_fork+0x7c/0xb0 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035208] [<ffffffff810cc4d0>] ? kthread_create_on_node+0x120/0x120 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035215] Code: 89 c2 89 d9 4c 89 c7 48 c1 e7 04 48 8d 34 38 48 3b 36 0f 84 b8 00 00 00 49 c1 e0 04 4b 8b 34 02 48 8b 7e 08 4c 8b 0e 48 8d 6e e0 <49> 89 79 08 4c 89 0f 48 bf 00 01 10 00 00 00 ad de 48 89 3e 48 Nov 5 13:44:20 192.168.1.6 kernel 01 [kern.alert] kernel: [ 6868.035326] RIP [<ffffffff8114119b>] __rmqueue+0x6b/0x3a0 Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.035336] RSP <ffff8800d68b38b0> Nov 5 13:44:20 192.168.1.6 kernel 04 [kern.warning] kernel: [ 6868.049110] ---[ end trace 65f94d10957f59d0 ]--- ----------------------- I''ve attached my kernel configuration to this email for those who care :) Does anyone have any idea what I''m facing here? If it weren''t for the stock kernel (without Xen) running stable I''d guess bad memory, but so far a memory test gave 0 errors (not that that''s a real indication). Feels like a bug / config problem somehow. Thanks for reading :) Regards, Wouter. _______________________________________________ Xen-users mailing list Xen-users@lists.xen.org http://lists.xen.org/xen-users
I''ve been experimenting some more. Last 24 hours I''ve been constantly compiling (in a while loop) using my (non-Xen) stock slackware kernel 3.10.7, stable as a rock. Just booted Xen 4.3.1 with my custom 3.11 kernel, crashed as soon as I did a rm -rf on some old sources. Here''s the console output: ------ (XEN) ----[ Xen-4.3.1 x86_64 debug=n Not tainted ]---- (XEN) CPU: 4 (XEN) RIP: e008:[<ffff82c4c013f47c>] do_dbs_timer+0x11c/0x240 (XEN) RFLAGS: 0000000000010286 CONTEXT: hypervisor (XEN) rax: 0000000000000000 rbx: 000000003b9d8704 rcx: 000000000000001d (XEN) rdx: 0000000000000000 rsi: 0000000000000000 rdi: 0000000000000000 (XEN) rbp: ffff830834fd6380 rsp: ffff830834fffe30 r8: 00000012d91afd3e (XEN) r9: ffff830834ff7128 r10: 0000000000000000 r11: 0000000000000000 (XEN) r12: 0000000000000000 r13: ffff830977948860 r14: 8000000000000380 (XEN) r15: 000000000000001d cr0: 000000008005003b cr4: 00000000000406f0 (XEN) cr3: 00000000d7c5f000 cr2: 0000000000000000 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) Xen stack trace from rsp=ffff830834fffe30: (XEN) 0000000000000286 ffff82c4c02ea940 ffff82c4c0300980 0027ac4021424b00 (XEN) 000000fb00000000 ffff831021424d00 ffff831021424d50 00000012d91b237a (XEN) 0000000000000004 0000000000000000 0000000000000000 ffff82c4c019bbf1 (XEN) 00000000ffffffff ffff82c4c02c7800 0014e1920000200d 0000000000000000 (XEN) 0000000000000000 00000000ffffffff ffff82c4c02c7800 ffff82c4c01245f4 (XEN) 000000000000e008 ffff830834ff8000 ffff830834ff8000 0000000000000004 (XEN) 0000000000000004 ffff82c4c01584ceffff8300d7afc000 (XEN) 0000004374cd5a00 0000000000000000 (XEN) Xen call trace: (XEN) [<ffff82c4c013f47c>] do_dbs_timer+0x11c/0x240 (XEN) [<ffff82c4c019bbf1>] acpi_processor_idle+0x201/0x550 (XEN) [<ffff82c4c01245f4>] __do_softirq+0x74/0xa0 (XEN) [<ffff82c4c01584ce>] idle_loop+0x1e/0x50 (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 4: (XEN) GENERAL PROTECTION FAULT (XEN) [error_code=0000] (XEN) **************************************** (XEN) (XEN) Manual reset required (''noreboot'' specified) ------ Suggestions anyone? :) Regards, Wouter.
On Tue, 2013-11-05 at 14:19 +0100, Wouter de Geus wrote:> I''ve been trying to get a new machine up and running with the latest > Xen for a while on a Slackware64 (current) machine. > After installing Xen from source and building a new kernel with all > xen options enabled I haven''t been able to get the machine to behave. > The machine is a brand new dual opteron 6212 on a Supermicro H8DGi > board with 64G ECC memory. > > Running a stock slackware kernel without xen works like a charm, > haven''t seen anything weird. > However, as soon as I boot Xen with my custom kernel the machine > panics within the hour. > When doing something intensive like building a kernel it''ll often > crash in a few minutes. > I''ve tried both Xen 4.3.0 and 4.3.1, no difference there. > The kernels I''ve tried were 3.11.4 and 3.11.6 and the brand new 3.12. > > The kernel panics are a bit different every time, but the most common > seems to be ''Bad page state in process X'' or ''unable to handle kernel > paging request at X'' and of course ''general protection fault''. > Here''s the most recent one:(trimmed the long common prefix so it''s not wrapped and therefore readable)> [ 6868.034665] general protection fault: 0000 [#1] SMP > [ 6868.034686] Modules linked in: > [ 6868.034697] CPU: 0 PID: 262 Comm: jbd2/md0-8 Not tainted 3.12.0-Desman #1 > [ 6868.034707] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 3.0 09/10/2012 > [ 6868.034717] task: ffff8800d7162b20 ti: ffff8800d68b2000 task.ti: ffff8800d68b2000 > [ 6868.034726] RIP: e030:[<ffffffff8114119b>] [<ffffffff8114119b>] __rmqueue+0x6b/0x3a0 > [ 6868.034745] RSP: e02b:ffff8800d68b38b0 EFLAGS: 00010012 > [ 6868.034752] RAX: ffff8801281d9e08 RBX: 0000000000000000 RCX: 0000000000000003 > [ 6868.034797] RDX: 0000000000000001 RSI: ffff8801281d9f22 RDI: 9f30ffff8801281d > [ 6868.034805] RBP: ffff8801281d9f02 R08: 0000000000000010 R09: 9f20ffff8801281d > [ 6868.034814] R10: ffff8801281d9f10 R11: 0000000000000058 R12: ffff8801281d9d80 > [ 6868.034822] R13: 0000000000000001 R14: ffff8801281d9e00 R15: ffffea00046534e0 > [ 6868.034837] FS: 00007ff41d295740(0000) GS:ffff880122a00000(0000) knlGS:0000000000000000 > [ 6868.034847] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 6868.034854] CR2: 00007f4d5f308fb5 CR3: 000000011d045000 CR4: 0000000000040660 > [ 6868.034864] Stack: > [ 6868.034869] 0000000000000000 ffff88011e4046c0 0000000000000001 0000000000001000 > [ 6868.034883] 0000000000000000 ffff8801281d9d80 0000000000000001 0000000000000000 > [ 6868.034897] 000000000000001f 0000000000000009 ffffea00046534e0 ffffffff81142f89 > [ 6868.034911] Call Trace: > [ 6868.034922] [<ffffffff81142f89>] ? get_page_from_freelist+0x329/0x900 > [ 6868.034935] [<ffffffff811436b4>] ? __alloc_pages_nodemask+0x154/0xa90 > [ 6868.034948] [<ffffffff811539bd>] ? zone_statistics+0x9d/0xa0 > [ 6868.034961] [<ffffffff8117ddd3>] ? __kmalloc+0xe3/0x120 > [ 6868.034975] [<ffffffff81263d9a>] ? ext4_ext_find_extent+0x26a/0x300 > [ 6868.034987] [<ffffffff811749a5>] ? alloc_pages_current+0xb5/0x180 > [ 6868.034999] [<ffffffff8117c3d5>] ? new_slab+0x255/0x2e0 > [ 6868.035011] [<ffffffff81d42807>] ? __slab_alloc+0x2a1/0x436 > [ 6868.035025] [<ffffffff811b8e18>] ? alloc_buffer_head+0x18/0x60 > [ 6868.035037] [<ffffffff8117db3b>] ? kmem_cache_alloc+0xab/0xd0 > [ 6868.035049] [<ffffffff811b8e18>] ? alloc_buffer_head+0x18/0x60 > [ 6868.035061] [<ffffffff811b8227>] ? generic_block_bmap+0x37/0x50 > [ 6868.035075] [<ffffffff81290676>] ? jbd2_journal_write_metadata_buffer+0x56/0x3c0 > [ 6868.035088] [<ffffffff8128aa41>] ? jbd2_journal_commit_transaction+0x721/0x16d0 > [ 6868.035103] [<ffffffff81007cbc>] ? xen_clocksource_read+0x1c/0x20 > [ 6868.035116] [<ffffffff81d4eed1>] ? _raw_spin_lock_irqsave+0x11/0x50 > [ 6868.035128] [<ffffffff8128e5df>] ? kjournald2+0xaf/0x240 > [ 6868.035140] [<ffffffff810cd2d0>] ? wake_up_atomic_t+0x30/0x30 > [ 6868.035152] [<ffffffff8128e530>] ? commit_timeout+0x10/0x10 > [ 6868.035163] [<ffffffff810cc57f>] ? kthread+0xaf/0xc0 > [ 6868.035174] [<ffffffff81007cbc>] ? xen_clocksource_read+0x1c/0x20 > [ 6868.035186] [<ffffffff810cc4d0>] ? kthread_create_on_node+0x120/0x120 > [ 6868.035197] [<ffffffff81d4facc>] ? ret_from_fork+0x7c/0xb0 > [ 6868.035208] [<ffffffff810cc4d0>] ? kthread_create_on_node+0x120/0x120 > [ 6868.035215] Code: 89 c2 89 d9 4c 89 c7 48 c1 e7 04 48 8d 34 38 48 3b 36 0f 84 b8 00 00 00 49 c1 e0 04 4b 8b 34 02 48 8b 7e 08 4c 8b 0e 48 8d 6e e0 <49> 89 79 08 4c 89 0f 48 bf 00 01 10 00 00 00 ad de 48 89 3e 48 > [ 6868.035326] RIP [<ffffffff8114119b>] __rmqueue+0x6b/0x3a0 > [ 6868.035336] RSP <ffff8800d68b38b0> > [ 6868.049110] ---[ end trace 65f94d10957f59d0 ]---> If it weren''t for the stock kernel (without Xen) running stableHave you run your own kernel (3.12.0-Desman, the one which crashes with Xen) without Xen underneath?> I''d guess bad memory, but so far a memory test gave 0 errors (not that > that''s a real indication). > Feels like a bug / config problem somehow.Yes, I agree. I''m afraid I''ve not seen anything like this, CCing the Xen pvops maintainers for input. Ian.
(CCing Linux guys, not because this involves Linux but because I CCed them on the previous mail) On Wed, 2013-11-06 at 10:12 +0100, Wouter de Geus wrote:> I''ve been experimenting some more. > Last 24 hours I''ve been constantly compiling (in a while loop) using my (non-Xen) stock slackware kernel 3.10.7, stable as a rock. > > Just booted Xen 4.3.1 with my custom 3.11 kernel, crashed as soon as I did a rm -rf on some old sources. > Here''s the console output: > ------ > (XEN) ----[ Xen-4.3.1 x86_64 debug=n Not tainted ]---- > (XEN) CPU: 4 > (XEN) RIP: e008:[<ffff82c4c013f47c>] do_dbs_timer+0x11c/0x240The is a cpufreq thing from the looks of it. cpufreq differences between native Linux and Xen could cause weird memory corruption, manifesting as a variety of page faults, GPFs etc, I guess. Perhaps investigate disabling cpufreq stuff under Xen? I''m not sure how one does this exactly but google through up http://wiki.xen.org/wiki/Xen_power_management and I saw some references in http://xenbits.xen.org/docs/unstable/misc/xen-command-line.html Ian.> (XEN) RFLAGS: 0000000000010286 CONTEXT: hypervisor > (XEN) rax: 0000000000000000 rbx: 000000003b9d8704 rcx: 000000000000001d > (XEN) rdx: 0000000000000000 rsi: 0000000000000000 rdi: 0000000000000000 > (XEN) rbp: ffff830834fd6380 rsp: ffff830834fffe30 r8: 00000012d91afd3e > (XEN) r9: ffff830834ff7128 r10: 0000000000000000 r11: 0000000000000000 > (XEN) r12: 0000000000000000 r13: ffff830977948860 r14: 8000000000000380 > (XEN) r15: 000000000000001d cr0: 000000008005003b cr4: 00000000000406f0 > (XEN) cr3: 00000000d7c5f000 cr2: 0000000000000000 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > (XEN) Xen stack trace from rsp=ffff830834fffe30: > (XEN) 0000000000000286 ffff82c4c02ea940 ffff82c4c0300980 0027ac4021424b00 > (XEN) 000000fb00000000 ffff831021424d00 ffff831021424d50 00000012d91b237a > (XEN) 0000000000000004 0000000000000000 0000000000000000 ffff82c4c019bbf1 > (XEN) 00000000ffffffff ffff82c4c02c7800 0014e1920000200d 0000000000000000 > (XEN) 0000000000000000 00000000ffffffff ffff82c4c02c7800 ffff82c4c01245f4 > (XEN) 000000000000e008 ffff830834ff8000 ffff830834ff8000 0000000000000004 > (XEN) 0000000000000004 ffff82c4c01584ceffff8300d7afc000 > (XEN) 0000004374cd5a00 0000000000000000 > (XEN) Xen call trace: > (XEN) [<ffff82c4c013f47c>] do_dbs_timer+0x11c/0x240 > (XEN) [<ffff82c4c019bbf1>] acpi_processor_idle+0x201/0x550 > (XEN) [<ffff82c4c01245f4>] __do_softirq+0x74/0xa0 > (XEN) [<ffff82c4c01584ce>] idle_loop+0x1e/0x50 > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 4: > (XEN) GENERAL PROTECTION FAULT > (XEN) [error_code=0000] > (XEN) **************************************** > (XEN) > (XEN) Manual reset required (''noreboot'' specified) > ------ > > Suggestions anyone? :) > > Regards, > > Wouter. > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xen.org > http://lists.xen.org/xen-users
* Ian Campbell <Ian.Campbell@citrix.com> [2013-11-06 09:41:39 +0000]:> The is a cpufreq thing from the looks of it. > > cpufreq differences between native Linux and Xen could cause weird > memory corruption, manifesting as a variety of page faults, GPFs etc, I > guess. > > Perhaps investigate disabling cpufreq stuff under Xen? I''m not sure how > one does this exactly but google through up > http://wiki.xen.org/wiki/Xen_power_management and I saw some references > in http://xenbits.xen.org/docs/unstable/misc/xen-command-line.html > > Ian.Thanks a lot for the insight! I''ve booted my 3.12-Desman kernel under xen with ''cpufreq=none'' on the xen commandline. So far so good (trying some kernel compiles to see if it''s stable, system has been up for 20 minutes now). If this turns out to be stable I''ll try again with cpufreq=dom0 to see if that''s also stable. I''ll report my findings if you care. If there''s anything you guys want me to test please let me know. Thanks again! Wouter.
On Wed, 2013-11-06 at 11:20 +0100, Wouter de Geus wrote:> * Ian Campbell <Ian.Campbell@citrix.com> [2013-11-06 09:41:39 +0000]: > > > The is a cpufreq thing from the looks of it. > > > > cpufreq differences between native Linux and Xen could cause weird > > memory corruption, manifesting as a variety of page faults, GPFs etc, I > > guess. > > > > Perhaps investigate disabling cpufreq stuff under Xen? I''m not sure how > > one does this exactly but google through up > > http://wiki.xen.org/wiki/Xen_power_management and I saw some references > > in http://xenbits.xen.org/docs/unstable/misc/xen-command-line.html > > > > Ian. > > Thanks a lot for the insight! > > I''ve booted my 3.12-Desman kernel under xen with ''cpufreq=none'' on the xen > commandline. So far so good (trying some kernel compiles to see if it''s > stable, system has been up for 20 minutes now). > If this turns out to be stable I''ll try again with cpufreq=dom0 to see if > that''s also stable. I''ll report my findings if you care.Please do. I suspect it shouldn''t be necessary to use command lines to override these things, but I''ve no idea how to diagnose this further. Once you have the findings if you could post a summary to xen-devel and CC jbeulich@suse.com & insong.liu@intel.com (cpufreq/power mgmt maintainers) perhaps they can advise. Ian.
* Ian Campbell <Ian.Campbell@citrix.com> [2013-11-06 10:51:07 +0000]:> > If this turns out to be stable I''ll try again with cpufreq=dom0 to see if > > that''s also stable. I''ll report my findings if you care. > > Please do.With cpufreq=none I''ve been able to run through a windows 2008 installation and some kernel compiles without problems. After that I rebooted with cpufreq=dom0, and within 5 minutes ran into the first oops again: [ 428.105061] BUG: unable to handle kernel paging request at ffffea0000dd8a48 [ 428.105103] IP: [<ffffffff8115c126>] unmap_single_vma+0x426/0x820 [ 428.105115] PGD 1281d6067 PUD 1281d5067 PMD 1281ce067 PTE 801000097bf53068 [ 428.105123] Oops: 0000 [#1] SMP [ 428.105127] Modules linked in: [ 428.105133] CPU: 3 PID: 1786 Comm: sh Not tainted 3.12.0-Desman #32 [ 428.105138] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 3.0 09/10/2012 [ 428.105142] task: ffff88011dbb1590 ti: ffff8800d5088000 task.ti: ffff8800d5088000 [ 428.105147] RIP: e030:[<ffffffff8115c126>] [<ffffffff8115c126>] unmap_single_vma+0x426/0x820 [ 428.105154] RSP: e02b:ffff8800d5089d30 EFLAGS: 00010246 [ 428.105157] RAX: 80000008002db165 RBX: ffff8800d2ad0d60 RCX: 0000000000dd8a40 [ 428.105161] RDX: 80000008002db165 RSI: 0000000001fac000 RDI: 80000008002db165 [ 428.105165] RBP: ffffea0000dd8a40 R08: ffff8800d2b52cf0 R09: 00000000fffffffa [ 428.105169] R10: 0000000000000a6f R11: 00000063ad0a7abc R12: 0000000001fe5000 [ 428.105173] R13: ffffc00000000fff R14: 0000000001fac000 R15: ffff8800d5089e40 [ 428.105181] FS: 00002b839c48c600(0000) GS:ffff880122a60000(0000) knlGS:0000000000000000 [ 428.105186] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [ 428.105215] CR2: ffffea0000dd8a48 CR3: 00000000021de000 CR4: 0000000000040660 [ 428.105220] Stack: [ 428.105222] ffff8800d6961c00 0000000000000000 ffff8800d2b52cf0 0000000000000000 [ 428.105229] ffffea00034ab430 80000008002db165 ffff8800c331c078 0000000001fe5000 [ 428.105236] ffff880000000000 00003ffffffff000 ffff88011dbb1590 0000000001fe4fff [ 428.105242] Call Trace: [ 428.105248] [<ffffffff8115d4c1>] ? unmap_vmas+0x41/0x90 [ 428.105254] [<ffffffff81165e1a>] ? exit_mmap+0x8a/0x150 [ 428.105261] [<ffffffff810abc19>] ? mmput+0x49/0x100 [ 428.105267] [<ffffffff810afb53>] ? do_exit+0x273/0xa30 [ 428.105273] [<ffffffff810dc045>] ? vtime_account_user+0x45/0x60 [ 428.105278] [<ffffffff810b10d4>] ? do_group_exit+0x34/0xa0 [ 428.105284] [<ffffffff810b114b>] ? SyS_exit_group+0xb/0x10 [ 428.105290] [<ffffffff81d4fd8f>] ? tracesys+0xe1/0xe6 [ 428.105294] Code: 48 8b 3c 24 4c 89 f6 48 89 da 66 66 66 90 66 66 90 41 80 4f 18 01 48 85 ed 0f 84 7a ff ff ff 48 83 7c 24 18 00 0f 85 02 03 00 00 <f6> 45 08 01 0f 84 70 01 00 00 48 89 ef ff 8c 24 98 00 00 00 e8 [ 428.105347] RIP [<ffffffff8115c126>] unmap_single_vma+0x426/0x820 [ 428.105353] RSP <ffff8800d5089d30> [ 428.105356] CR2: ffffea0000dd8a48 [ 428.105360] ---[ end trace 81935aa1c6524ae3 ]---> I suspect it shouldn''t be necessary to use command lines to override > these things, but I''ve no idea how to diagnose this further.Removing the entire cpufreq part from my dom0 kernel might help :) But then again, if that''s a problem I would like the hypervisor to detect and avoid this problem if that''s possible.> Once you have the findings if you could post a summary to xen-devel and > CC jbeulich@suse.com & insong.liu@intel.com (cpufreq/power mgmt > maintainers) perhaps they can advise.Summary: -------- The issue: Xen 4.3.1 and my Linux 3.12 build (with cpufreq) panics (page requests, GPF, bad page state) usually within a few minutes. When Xen is booted with cpufreq=none the problem seems to disappear, with cpufreq=dom0 the problem is still there. The machine I run this on is a dual opteron 6212 with 64GB ECC RAM on a Supermicro H8DGi board. Regards, Wouter.
* Ian Campbell <Ian.Campbell@citrix.com> [2013-11-06 10:51:07 +0000]:> > If this turns out to be stable I''ll try again with cpufreq=dom0 to see if > > that''s also stable. I''ll report my findings if you care. > > Please do.With cpufreq=none I''ve been able to run through a windows 2008 installation and some kernel compiles without problems. After that I rebooted with cpufreq=dom0, and within 5 minutes ran into the first oops again: [ 428.105061] BUG: unable to handle kernel paging request at ffffea0000dd8a48 [ 428.105103] IP: [<ffffffff8115c126>] unmap_single_vma+0x426/0x820 [ 428.105115] PGD 1281d6067 PUD 1281d5067 PMD 1281ce067 PTE 801000097bf53068 [ 428.105123] Oops: 0000 [#1] SMP [ 428.105127] Modules linked in: [ 428.105133] CPU: 3 PID: 1786 Comm: sh Not tainted 3.12.0-Desman #32 [ 428.105138] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 3.0 09/10/2012 [ 428.105142] task: ffff88011dbb1590 ti: ffff8800d5088000 task.ti: ffff8800d5088000 [ 428.105147] RIP: e030:[<ffffffff8115c126>] [<ffffffff8115c126>] unmap_single_vma+0x426/0x820 [ 428.105154] RSP: e02b:ffff8800d5089d30 EFLAGS: 00010246 [ 428.105157] RAX: 80000008002db165 RBX: ffff8800d2ad0d60 RCX: 0000000000dd8a40 [ 428.105161] RDX: 80000008002db165 RSI: 0000000001fac000 RDI: 80000008002db165 [ 428.105165] RBP: ffffea0000dd8a40 R08: ffff8800d2b52cf0 R09: 00000000fffffffa [ 428.105169] R10: 0000000000000a6f R11: 00000063ad0a7abc R12: 0000000001fe5000 [ 428.105173] R13: ffffc00000000fff R14: 0000000001fac000 R15: ffff8800d5089e40 [ 428.105181] FS: 00002b839c48c600(0000) GS:ffff880122a60000(0000) knlGS:0000000000000000 [ 428.105186] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [ 428.105215] CR2: ffffea0000dd8a48 CR3: 00000000021de000 CR4: 0000000000040660 [ 428.105220] Stack: [ 428.105222] ffff8800d6961c00 0000000000000000 ffff8800d2b52cf0 0000000000000000 [ 428.105229] ffffea00034ab430 80000008002db165 ffff8800c331c078 0000000001fe5000 [ 428.105236] ffff880000000000 00003ffffffff000 ffff88011dbb1590 0000000001fe4fff [ 428.105242] Call Trace: [ 428.105248] [<ffffffff8115d4c1>] ? unmap_vmas+0x41/0x90 [ 428.105254] [<ffffffff81165e1a>] ? exit_mmap+0x8a/0x150 [ 428.105261] [<ffffffff810abc19>] ? mmput+0x49/0x100 [ 428.105267] [<ffffffff810afb53>] ? do_exit+0x273/0xa30 [ 428.105273] [<ffffffff810dc045>] ? vtime_account_user+0x45/0x60 [ 428.105278] [<ffffffff810b10d4>] ? do_group_exit+0x34/0xa0 [ 428.105284] [<ffffffff810b114b>] ? SyS_exit_group+0xb/0x10 [ 428.105290] [<ffffffff81d4fd8f>] ? tracesys+0xe1/0xe6 [ 428.105294] Code: 48 8b 3c 24 4c 89 f6 48 89 da 66 66 66 90 66 66 90 41 80 4f 18 01 48 85 ed 0f 84 7a ff ff ff 48 83 7c 24 18 00 0f 85 02 03 00 00 <f6> 45 08 01 0f 84 70 01 00 00 48 89 ef ff 8c 24 98 00 00 00 e8 [ 428.105347] RIP [<ffffffff8115c126>] unmap_single_vma+0x426/0x820 [ 428.105353] RSP <ffff8800d5089d30> [ 428.105356] CR2: ffffea0000dd8a48 [ 428.105360] ---[ end trace 81935aa1c6524ae3 ]---> I suspect it shouldn''t be necessary to use command lines to override > these things, but I''ve no idea how to diagnose this further.Removing the entire cpufreq part from my dom0 kernel might help :) But then again, if that''s a problem I would like the hypervisor to detect and avoid this problem if that''s possible.> Once you have the findings if you could post a summary to xen-devel and > CC jbeulich@suse.com & insong.liu@intel.com (cpufreq/power mgmt > maintainers) perhaps they can advise.Summary: -------- The issue: Xen 4.3.1 and my Linux 3.12 build (with cpufreq) panics (page requests, GPF, bad page state) usually within a few minutes. When Xen is booted with cpufreq=none the problem seems to disappear, with cpufreq=dom0 the problem is still there. The machine I run this on is a dual opteron 6212 with 64GB ECC RAM on a Supermicro H8DGi board. Regards, Wouter.
Hello, I use currently (in grub.cfg) : multiboot /xen-4.2-amd64.gz placeholder dom0_mem=6144M cpufreq=xen cpuidle vtd=1 iommu=1 loop.max_loop=64 with an AMD processor and it works without any kind of kernel bugs. Regards JP P ----- Mail original ----- De: "Wouter de Geus" <benv-xensource.com@junerules.com> À: xen-users@lists.xen.org Cc: "insong liu" <insong.liu@intel.com>, jbeulich@suse.com, xen-devel@lists.xen.org Envoyé: Mercredi 6 Novembre 2013 14:25:28 Objet: Re: [Xen-users] Xen 4.3.1 / Linux 3.12 panic * Ian Campbell <Ian.Campbell@citrix.com> [2013-11-06 10:51:07 +0000]:> > If this turns out to be stable I'll try again with cpufreq=dom0 to see if > > that's also stable. I'll report my findings if you care. > > Please do.With cpufreq=none I've been able to run through a windows 2008 installation and some kernel compiles without problems. After that I rebooted with cpufreq=dom0, and within 5 minutes ran into the first oops again: [ 428.105061] BUG: unable to handle kernel paging request at ffffea0000dd8a48 [ 428.105103] IP: [<ffffffff8115c126>] unmap_single_vma+0x426/0x820 [ 428.105115] PGD 1281d6067 PUD 1281d5067 PMD 1281ce067 PTE 801000097bf53068 [ 428.105123] Oops: 0000 [#1] SMP [ 428.105127] Modules linked in: [ 428.105133] CPU: 3 PID: 1786 Comm: sh Not tainted 3.12.0-Desman #32 [ 428.105138] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 3.0 09/10/2012 [ 428.105142] task: ffff88011dbb1590 ti: ffff8800d5088000 task.ti: ffff8800d5088000 [ 428.105147] RIP: e030:[<ffffffff8115c126>] [<ffffffff8115c126>] unmap_single_vma+0x426/0x820 [ 428.105154] RSP: e02b:ffff8800d5089d30 EFLAGS: 00010246 [ 428.105157] RAX: 80000008002db165 RBX: ffff8800d2ad0d60 RCX: 0000000000dd8a40 [ 428.105161] RDX: 80000008002db165 RSI: 0000000001fac000 RDI: 80000008002db165 [ 428.105165] RBP: ffffea0000dd8a40 R08: ffff8800d2b52cf0 R09: 00000000fffffffa [ 428.105169] R10: 0000000000000a6f R11: 00000063ad0a7abc R12: 0000000001fe5000 [ 428.105173] R13: ffffc00000000fff R14: 0000000001fac000 R15: ffff8800d5089e40 [ 428.105181] FS: 00002b839c48c600(0000) GS:ffff880122a60000(0000) knlGS:0000000000000000 [ 428.105186] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [ 428.105215] CR2: ffffea0000dd8a48 CR3: 00000000021de000 CR4: 0000000000040660 [ 428.105220] Stack: [ 428.105222] ffff8800d6961c00 0000000000000000 ffff8800d2b52cf0 0000000000000000 [ 428.105229] ffffea00034ab430 80000008002db165 ffff8800c331c078 0000000001fe5000 [ 428.105236] ffff880000000000 00003ffffffff000 ffff88011dbb1590 0000000001fe4fff [ 428.105242] Call Trace: [ 428.105248] [<ffffffff8115d4c1>] ? unmap_vmas+0x41/0x90 [ 428.105254] [<ffffffff81165e1a>] ? exit_mmap+0x8a/0x150 [ 428.105261] [<ffffffff810abc19>] ? mmput+0x49/0x100 [ 428.105267] [<ffffffff810afb53>] ? do_exit+0x273/0xa30 [ 428.105273] [<ffffffff810dc045>] ? vtime_account_user+0x45/0x60 [ 428.105278] [<ffffffff810b10d4>] ? do_group_exit+0x34/0xa0 [ 428.105284] [<ffffffff810b114b>] ? SyS_exit_group+0xb/0x10 [ 428.105290] [<ffffffff81d4fd8f>] ? tracesys+0xe1/0xe6 [ 428.105294] Code: 48 8b 3c 24 4c 89 f6 48 89 da 66 66 66 90 66 66 90 41 80 4f 18 01 48 85 ed 0f 84 7a ff ff ff 48 83 7c 24 18 00 0f 85 02 03 00 00 <f6> 45 08 01 0f 84 70 01 00 00 48 89 ef ff 8c 24 98 00 00 00 e8 [ 428.105347] RIP [<ffffffff8115c126>] unmap_single_vma+0x426/0x820 [ 428.105353] RSP <ffff8800d5089d30> [ 428.105356] CR2: ffffea0000dd8a48 [ 428.105360] ---[ end trace 81935aa1c6524ae3 ]---> I suspect it shouldn't be necessary to use command lines to override > these things, but I've no idea how to diagnose this further.Removing the entire cpufreq part from my dom0 kernel might help :) But then again, if that's a problem I would like the hypervisor to detect and avoid this problem if that's possible.> Once you have the findings if you could post a summary to xen-devel and > CC jbeulich@suse.com & insong.liu@intel.com (cpufreq/power mgmt > maintainers) perhaps they can advise.Summary: -------- The issue: Xen 4.3.1 and my Linux 3.12 build (with cpufreq) panics (page requests, GPF, bad page state) usually within a few minutes. When Xen is booted with cpufreq=none the problem seems to disappear, with cpufreq=dom0 the problem is still there. The machine I run this on is a dual opteron 6212 with 64GB ECC RAM on a Supermicro H8DGi board. Regards, Wouter. _______________________________________________ Xen-users mailing list Xen-users@lists.xen.org http://lists.xen.org/xen-users _______________________________________________ Xen-users mailing list Xen-users@lists.xen.org http://lists.xen.org/xen-users
* Jean-Paul Pozzi <jpp@jppozzi.dyndns.org> [2013-11-06 14:49:52 +0100]:> Hello,Hello,> I use currently (in grub.cfg) : > multiboot /xen-4.2-amd64.gz placeholder dom0_mem=6144M cpufreq=xen cpuidle vtd=1 iommu=1 loop.max_loop=64 > with an AMD processor and it works without any kind of kernel bugs.Yeah, well, I have 4 more machines in the datacenter running Xen 4.1 and Xen 4.2 with AMD processors, even one that has an Opteron processor (but not the same model). However, this new machine I have has the problem I mentioned before... New hardware, new problems :) And I would throw this on a hardware problem weren''t it that the kernel without Xen works flawlessly (and no memtest errors either). Regards, Wouter.
On Wed, Nov 06, 2013 at 02:25:28PM +0100, Wouter de Geus wrote:> * Ian Campbell <Ian.Campbell@citrix.com> [2013-11-06 10:51:07 +0000]: > > > > If this turns out to be stable I''ll try again with cpufreq=dom0 to see if > > > that''s also stable. I''ll report my findings if you care. > > > > Please do. > > With cpufreq=none I''ve been able to run through a windows 2008 installation > and some kernel compiles without problems. After that I rebooted with > cpufreq=dom0, and within 5 minutes ran into the first oops again:Is there a particular reason you had tried ''cpufreq''? Sorry if that was answered earlier?> > [ 428.105061] BUG: unable to handle kernel paging request at ffffea0000dd8a48 > [ 428.105103] IP: [<ffffffff8115c126>] unmap_single_vma+0x426/0x820 > [ 428.105115] PGD 1281d6067 PUD 1281d5067 PMD 1281ce067 PTE 801000097bf53068 > [ 428.105123] Oops: 0000 [#1] SMP > [ 428.105127] Modules linked in: > [ 428.105133] CPU: 3 PID: 1786 Comm: sh Not tainted 3.12.0-Desman #32 > [ 428.105138] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 3.0 09/10/2012 > [ 428.105142] task: ffff88011dbb1590 ti: ffff8800d5088000 task.ti: ffff8800d5088000 > [ 428.105147] RIP: e030:[<ffffffff8115c126>] [<ffffffff8115c126>] unmap_single_vma+0x426/0x820 > [ 428.105154] RSP: e02b:ffff8800d5089d30 EFLAGS: 00010246 > [ 428.105157] RAX: 80000008002db165 RBX: ffff8800d2ad0d60 RCX: 0000000000dd8a40 > [ 428.105161] RDX: 80000008002db165 RSI: 0000000001fac000 RDI: 80000008002db165 > [ 428.105165] RBP: ffffea0000dd8a40 R08: ffff8800d2b52cf0 R09: 00000000fffffffa > [ 428.105169] R10: 0000000000000a6f R11: 00000063ad0a7abc R12: 0000000001fe5000 > [ 428.105173] R13: ffffc00000000fff R14: 0000000001fac000 R15: ffff8800d5089e40 > [ 428.105181] FS: 00002b839c48c600(0000) GS:ffff880122a60000(0000) knlGS:0000000000000000 > [ 428.105186] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 428.105215] CR2: ffffea0000dd8a48 CR3: 00000000021de000 CR4: 0000000000040660 > [ 428.105220] Stack: > [ 428.105222] ffff8800d6961c00 0000000000000000 ffff8800d2b52cf0 0000000000000000 > [ 428.105229] ffffea00034ab430 80000008002db165 ffff8800c331c078 0000000001fe5000 > [ 428.105236] ffff880000000000 00003ffffffff000 ffff88011dbb1590 0000000001fe4fff > [ 428.105242] Call Trace: > [ 428.105248] [<ffffffff8115d4c1>] ? unmap_vmas+0x41/0x90 > [ 428.105254] [<ffffffff81165e1a>] ? exit_mmap+0x8a/0x150 > [ 428.105261] [<ffffffff810abc19>] ? mmput+0x49/0x100 > [ 428.105267] [<ffffffff810afb53>] ? do_exit+0x273/0xa30 > [ 428.105273] [<ffffffff810dc045>] ? vtime_account_user+0x45/0x60 > [ 428.105278] [<ffffffff810b10d4>] ? do_group_exit+0x34/0xa0 > [ 428.105284] [<ffffffff810b114b>] ? SyS_exit_group+0xb/0x10 > [ 428.105290] [<ffffffff81d4fd8f>] ? tracesys+0xe1/0xe6 > [ 428.105294] Code: 48 8b 3c 24 4c 89 f6 48 89 da 66 66 66 90 66 66 90 41 80 4f 18 01 48 85 ed 0f 84 7a ff ff ff 48 83 7c 24 18 00 0f 85 02 03 00 00 <f6> 45 08 01 0f 84 70 01 00 00 48 89 ef ff 8c 24 98 00 00 00 e8 > [ 428.105347] RIP [<ffffffff8115c126>] unmap_single_vma+0x426/0x820 > [ 428.105353] RSP <ffff8800d5089d30> > [ 428.105356] CR2: ffffea0000dd8a48 > [ 428.105360] ---[ end trace 81935aa1c6524ae3 ]--- > > > I suspect it shouldn''t be necessary to use command lines to override > > these things, but I''ve no idea how to diagnose this further. > > Removing the entire cpufreq part from my dom0 kernel might help :) > But then again, if that''s a problem I would like the hypervisor to detect > and avoid this problem if that''s possible.So the cpufreq=dom0 is kind of an nops as the Linux kernel will disable the native CPUfreq machinery. This is done b/c it does not make sense for Linux dom0 to control the CPU freq when it has no idea of the workloads (the hypervisor has it). But with the ''cpufreq=dom0'' you are getting faults. So the other question is - does anything happen if you disable ACPI power states in the BIOS?> > > Once you have the findings if you could post a summary to xen-devel and > > CC jbeulich@suse.com & insong.liu@intel.com (cpufreq/power mgmt > > maintainers) perhaps they can advise. > > Summary: > -------- > The issue: Xen 4.3.1 and my Linux 3.12 build (with cpufreq) panics (page > requests, GPF, bad page state) usually within a few minutes. > When Xen is booted with cpufreq=none the problem seems to disappear, with > cpufreq=dom0 the problem is still there. > The machine I run this on is a dual opteron 6212 with 64GB ECC RAM on a > Supermicro H8DGi board. > > Regards, > > Wouter. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Wed, Nov 06, 2013 at 02:25:28PM +0100, Wouter de Geus wrote:> * Ian Campbell <Ian.Campbell@citrix.com> [2013-11-06 10:51:07 +0000]: > > > > If this turns out to be stable I''ll try again with cpufreq=dom0 to see if > > > that''s also stable. I''ll report my findings if you care. > > > > Please do. > > With cpufreq=none I''ve been able to run through a windows 2008 installation > and some kernel compiles without problems. After that I rebooted with > cpufreq=dom0, and within 5 minutes ran into the first oops again:Is there a particular reason you had tried ''cpufreq''? Sorry if that was answered earlier?> > [ 428.105061] BUG: unable to handle kernel paging request at ffffea0000dd8a48 > [ 428.105103] IP: [<ffffffff8115c126>] unmap_single_vma+0x426/0x820 > [ 428.105115] PGD 1281d6067 PUD 1281d5067 PMD 1281ce067 PTE 801000097bf53068 > [ 428.105123] Oops: 0000 [#1] SMP > [ 428.105127] Modules linked in: > [ 428.105133] CPU: 3 PID: 1786 Comm: sh Not tainted 3.12.0-Desman #32 > [ 428.105138] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 3.0 09/10/2012 > [ 428.105142] task: ffff88011dbb1590 ti: ffff8800d5088000 task.ti: ffff8800d5088000 > [ 428.105147] RIP: e030:[<ffffffff8115c126>] [<ffffffff8115c126>] unmap_single_vma+0x426/0x820 > [ 428.105154] RSP: e02b:ffff8800d5089d30 EFLAGS: 00010246 > [ 428.105157] RAX: 80000008002db165 RBX: ffff8800d2ad0d60 RCX: 0000000000dd8a40 > [ 428.105161] RDX: 80000008002db165 RSI: 0000000001fac000 RDI: 80000008002db165 > [ 428.105165] RBP: ffffea0000dd8a40 R08: ffff8800d2b52cf0 R09: 00000000fffffffa > [ 428.105169] R10: 0000000000000a6f R11: 00000063ad0a7abc R12: 0000000001fe5000 > [ 428.105173] R13: ffffc00000000fff R14: 0000000001fac000 R15: ffff8800d5089e40 > [ 428.105181] FS: 00002b839c48c600(0000) GS:ffff880122a60000(0000) knlGS:0000000000000000 > [ 428.105186] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 428.105215] CR2: ffffea0000dd8a48 CR3: 00000000021de000 CR4: 0000000000040660 > [ 428.105220] Stack: > [ 428.105222] ffff8800d6961c00 0000000000000000 ffff8800d2b52cf0 0000000000000000 > [ 428.105229] ffffea00034ab430 80000008002db165 ffff8800c331c078 0000000001fe5000 > [ 428.105236] ffff880000000000 00003ffffffff000 ffff88011dbb1590 0000000001fe4fff > [ 428.105242] Call Trace: > [ 428.105248] [<ffffffff8115d4c1>] ? unmap_vmas+0x41/0x90 > [ 428.105254] [<ffffffff81165e1a>] ? exit_mmap+0x8a/0x150 > [ 428.105261] [<ffffffff810abc19>] ? mmput+0x49/0x100 > [ 428.105267] [<ffffffff810afb53>] ? do_exit+0x273/0xa30 > [ 428.105273] [<ffffffff810dc045>] ? vtime_account_user+0x45/0x60 > [ 428.105278] [<ffffffff810b10d4>] ? do_group_exit+0x34/0xa0 > [ 428.105284] [<ffffffff810b114b>] ? SyS_exit_group+0xb/0x10 > [ 428.105290] [<ffffffff81d4fd8f>] ? tracesys+0xe1/0xe6 > [ 428.105294] Code: 48 8b 3c 24 4c 89 f6 48 89 da 66 66 66 90 66 66 90 41 80 4f 18 01 48 85 ed 0f 84 7a ff ff ff 48 83 7c 24 18 00 0f 85 02 03 00 00 <f6> 45 08 01 0f 84 70 01 00 00 48 89 ef ff 8c 24 98 00 00 00 e8 > [ 428.105347] RIP [<ffffffff8115c126>] unmap_single_vma+0x426/0x820 > [ 428.105353] RSP <ffff8800d5089d30> > [ 428.105356] CR2: ffffea0000dd8a48 > [ 428.105360] ---[ end trace 81935aa1c6524ae3 ]--- > > > I suspect it shouldn''t be necessary to use command lines to override > > these things, but I''ve no idea how to diagnose this further. > > Removing the entire cpufreq part from my dom0 kernel might help :) > But then again, if that''s a problem I would like the hypervisor to detect > and avoid this problem if that''s possible.So the cpufreq=dom0 is kind of an nops as the Linux kernel will disable the native CPUfreq machinery. This is done b/c it does not make sense for Linux dom0 to control the CPU freq when it has no idea of the workloads (the hypervisor has it). But with the ''cpufreq=dom0'' you are getting faults. So the other question is - does anything happen if you disable ACPI power states in the BIOS?> > > Once you have the findings if you could post a summary to xen-devel and > > CC jbeulich@suse.com & insong.liu@intel.com (cpufreq/power mgmt > > maintainers) perhaps they can advise. > > Summary: > -------- > The issue: Xen 4.3.1 and my Linux 3.12 build (with cpufreq) panics (page > requests, GPF, bad page state) usually within a few minutes. > When Xen is booted with cpufreq=none the problem seems to disappear, with > cpufreq=dom0 the problem is still there. > The machine I run this on is a dual opteron 6212 with 64GB ECC RAM on a > Supermicro H8DGi board. > > Regards, > > Wouter. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Wed, Nov 06, 2013 at 09:41:39AM +0000, Ian Campbell wrote:> (CCing Linux guys, not because this involves Linux but because I CCed > them on the previous mail) > > On Wed, 2013-11-06 at 10:12 +0100, Wouter de Geus wrote: > > I''ve been experimenting some more. > > Last 24 hours I''ve been constantly compiling (in a while loop) using my (non-Xen) stock slackware kernel 3.10.7, stable as a rock. > > > > Just booted Xen 4.3.1 with my custom 3.11 kernel, crashed as soon as I did a rm -rf on some old sources. > > Here''s the console output: > > ------ > > (XEN) ----[ Xen-4.3.1 x86_64 debug=n Not tainted ]---- > > (XEN) CPU: 4 > > (XEN) RIP: e008:[<ffff82c4c013f47c>] do_dbs_timer+0x11c/0x240 > > The is a cpufreq thing from the looks of it. > > cpufreq differences between native Linux and Xen could cause weird > memory corruption, manifesting as a variety of page faults, GPFs etc, I > guess. > > Perhaps investigate disabling cpufreq stuff under Xen? I''m not sure how > one does this exactly but google through up > http://wiki.xen.org/wiki/Xen_power_management and I saw some references > in http://xenbits.xen.org/docs/unstable/misc/xen-command-line.html > > Ian. > > > > (XEN) RFLAGS: 0000000000010286 CONTEXT: hypervisor > > (XEN) rax: 0000000000000000 rbx: 000000003b9d8704 rcx: 000000000000001d > > (XEN) rdx: 0000000000000000 rsi: 0000000000000000 rdi: 0000000000000000 > > (XEN) rbp: ffff830834fd6380 rsp: ffff830834fffe30 r8: 00000012d91afd3e > > (XEN) r9: ffff830834ff7128 r10: 0000000000000000 r11: 0000000000000000 > > (XEN) r12: 0000000000000000 r13: ffff830977948860 r14: 8000000000000380 > > (XEN) r15: 000000000000001d cr0: 000000008005003b cr4: 00000000000406f0 > > (XEN) cr3: 00000000d7c5f000 cr2: 0000000000000000 > > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > > (XEN) Xen stack trace from rsp=ffff830834fffe30: > > (XEN) 0000000000000286 ffff82c4c02ea940 ffff82c4c0300980 0027ac4021424b00 > > (XEN) 000000fb00000000 ffff831021424d00 ffff831021424d50 00000012d91b237a > > (XEN) 0000000000000004 0000000000000000 0000000000000000 ffff82c4c019bbf1 > > (XEN) 00000000ffffffff ffff82c4c02c7800 0014e1920000200d 0000000000000000 > > (XEN) 0000000000000000 00000000ffffffff ffff82c4c02c7800 ffff82c4c01245f4 > > (XEN) 000000000000e008 ffff830834ff8000 ffff830834ff8000 0000000000000004 > > (XEN) 0000000000000004 ffff82c4c01584ce 0000000000000000 0000000000000000 > > (XEN) 0000000000000001 0000000000000000 0000000000000000 0000000000000000 > > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > (XEN) 0000000000000000 0000000000000000 0000000000000004 ffff8300d7afc000 > > (XEN) 0000004374cd5a00 0000000000000000 > > (XEN) Xen call trace: > > (XEN) [<ffff82c4c013f47c>] do_dbs_timer+0x11c/0x240 > > (XEN) [<ffff82c4c019bbf1>] acpi_processor_idle+0x201/0x550 > > (XEN) [<ffff82c4c01245f4>] __do_softirq+0x74/0xa0 > > (XEN) [<ffff82c4c01584ce>] idle_loop+0x1e/0x50That is just impressive. I see a bunch of computations that it might be doing. But I can''t reproduce it with Xen 4.4 on an AMD box. Could you pass in the full serial log? I am curios what your config options are ? And when does it happen? Is there a specific workload you are doing? Thanks!> > (XEN) > > (XEN) > > (XEN) **************************************** > > (XEN) Panic on CPU 4: > > (XEN) GENERAL PROTECTION FAULT > > (XEN) [error_code=0000] > > (XEN) **************************************** > > (XEN) > > (XEN) Manual reset required (''noreboot'' specified) > > ------ > > > > Suggestions anyone? :) > > > > Regards, > > > > Wouter. > > > > _______________________________________________ > > Xen-users mailing list > > Xen-users@lists.xen.org > > http://lists.xen.org/xen-users > >
* Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [2013-11-06 15:44:58 -0500]:> Is there a particular reason you had tried ''cpufreq''? Sorry if that > was answered earlier?Ian suggested that the problem might be cpufreq related, (see http://lists.xenproject.org/archives/html/xen-users/2013-11/msg00053.html )> So the cpufreq=dom0 is kind of an nops as the Linux kernel will disable > the native CPUfreq machinery. This is done b/c it does not make sense > for Linux dom0 to control the CPU freq when it has no idea of the > workloads (the hypervisor has it).Aha. Not exactly what I understood from the xen documentation (http://wiki.xen.org/wiki/Xen_power_management), but I was just testing it to see if it would be stable.> But with the ''cpufreq=dom0'' you are getting faults.With both cpufreq=dom0 and not specifying cpufreq at all, which defaults to cpufreq=xen according to the docs the system will crash within the hour. With cpufreq=none the system has now been stable for over a day without any kernel warnings etc whatsoever (and I tried compiling a kernel for some load).> So the other question is - does anything happen if you disable ACPI power > states in the BIOS?Let''s try ;) The BIOS has the following options that I consider relevant: Name [Current] (Options) - PowerNow [Enabled] - C State Mode [C6] (Disabled, C6) - PowerCap [P-state 0] (P-state 0 through 4) - HPC Mode [Enabled] (Disabled, Enabled) - CPB Mode [Auto] (Disabled, Auto) - C1E Support [Enabled] (Enabled, Disabled) When I set PowerNow to disabled the C State Mode, PowerCap and HPC options also disappear. After booting with PowerNow disabled (without the cpufreq option) I tried a kernel compile twice and some heavy I/O under which the system was stable. So that seems to have the same effect as cpufreq=none. - With PowerNow enabled (C State Mode and HPC / CPB Mode disabled ,no cpufreq cmdline) the system also seems stable. - C State Mode set to C6 (HPC/CPB disabled, no cpufreq cmdline), seems stable. - HPC enabled, (CPB disabled, no cpufreq cmdline), crashed (serial console log attached as xen-crash-hpc-enabled). These tests are inconclusive since I can''t reliably trigger a panic, but it usually happened within a few minutes when compiling a kernel or other load. Even without load the system would crash mind you, just ''idling'' it managed to crash as well but that often took longer. Anyhow, please let me know if there''s anything else I can test for you guys. And thanks for the help :) Regards, Wouter. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
* Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [2013-11-06 15:44:58 -0500]:> Is there a particular reason you had tried ''cpufreq''? Sorry if that > was answered earlier?Ian suggested that the problem might be cpufreq related, (see http://lists.xenproject.org/archives/html/xen-users/2013-11/msg00053.html )> So the cpufreq=dom0 is kind of an nops as the Linux kernel will disable > the native CPUfreq machinery. This is done b/c it does not make sense > for Linux dom0 to control the CPU freq when it has no idea of the > workloads (the hypervisor has it).Aha. Not exactly what I understood from the xen documentation (http://wiki.xen.org/wiki/Xen_power_management), but I was just testing it to see if it would be stable.> But with the ''cpufreq=dom0'' you are getting faults.With both cpufreq=dom0 and not specifying cpufreq at all, which defaults to cpufreq=xen according to the docs the system will crash within the hour. With cpufreq=none the system has now been stable for over a day without any kernel warnings etc whatsoever (and I tried compiling a kernel for some load).> So the other question is - does anything happen if you disable ACPI power > states in the BIOS?Let''s try ;) The BIOS has the following options that I consider relevant: Name [Current] (Options) - PowerNow [Enabled] - C State Mode [C6] (Disabled, C6) - PowerCap [P-state 0] (P-state 0 through 4) - HPC Mode [Enabled] (Disabled, Enabled) - CPB Mode [Auto] (Disabled, Auto) - C1E Support [Enabled] (Enabled, Disabled) When I set PowerNow to disabled the C State Mode, PowerCap and HPC options also disappear. After booting with PowerNow disabled (without the cpufreq option) I tried a kernel compile twice and some heavy I/O under which the system was stable. So that seems to have the same effect as cpufreq=none. - With PowerNow enabled (C State Mode and HPC / CPB Mode disabled ,no cpufreq cmdline) the system also seems stable. - C State Mode set to C6 (HPC/CPB disabled, no cpufreq cmdline), seems stable. - HPC enabled, (CPB disabled, no cpufreq cmdline), crashed (serial console log attached as xen-crash-hpc-enabled). These tests are inconclusive since I can''t reliably trigger a panic, but it usually happened within a few minutes when compiling a kernel or other load. Even without load the system would crash mind you, just ''idling'' it managed to crash as well but that often took longer. Anyhow, please let me know if there''s anything else I can test for you guys. And thanks for the help :) Regards, Wouter. _______________________________________________ Xen-users mailing list Xen-users@lists.xen.org http://lists.xen.org/xen-users
On Wed, 2013-11-06 at 15:44 -0500, Konrad Rzeszutek Wilk wrote:> On Wed, Nov 06, 2013 at 02:25:28PM +0100, Wouter de Geus wrote: > > * Ian Campbell <Ian.Campbell@citrix.com> [2013-11-06 10:51:07 +0000]: > > > > > > If this turns out to be stable I''ll try again with cpufreq=dom0 to see if > > > > that''s also stable. I''ll report my findings if you care. > > > > > > Please do. > > > > With cpufreq=none I''ve been able to run through a windows 2008 installation > > and some kernel compiles without problems. After that I rebooted with > > cpufreq=dom0, and within 5 minutes ran into the first oops again: > > Is there a particular reason you had tried ''cpufreq''? Sorry if that > was answered earlier?I suggested it in <1383730899.26213.16.camel@kazak.uk.xensource.com> (on xen-users only, you were CCd though) because one of the crashes was on the hypervisor side and involved do_dbs_timer which looked (from the file comments) to be cpufreq related. Ian.
On Wed, 2013-11-06 at 15:44 -0500, Konrad Rzeszutek Wilk wrote:> On Wed, Nov 06, 2013 at 02:25:28PM +0100, Wouter de Geus wrote: > > * Ian Campbell <Ian.Campbell@citrix.com> [2013-11-06 10:51:07 +0000]: > > > > > > If this turns out to be stable I''ll try again with cpufreq=dom0 to see if > > > > that''s also stable. I''ll report my findings if you care. > > > > > > Please do. > > > > With cpufreq=none I''ve been able to run through a windows 2008 installation > > and some kernel compiles without problems. After that I rebooted with > > cpufreq=dom0, and within 5 minutes ran into the first oops again: > > Is there a particular reason you had tried ''cpufreq''? Sorry if that > was answered earlier?I suggested it in <1383730899.26213.16.camel@kazak.uk.xensource.com> (on xen-users only, you were CCd though) because one of the crashes was on the hypervisor side and involved do_dbs_timer which looked (from the file comments) to be cpufreq related. Ian.
>>> On 07.11.13 at 12:20, Wouter de Geus <benv-xensource.com@junerules.com> wrote: > The BIOS has the following options that I consider relevant: > Name [Current] (Options) > - PowerNow [Enabled] > - C State Mode [C6] (Disabled, C6) > - PowerCap [P-state 0] (P-state 0 through 4) > - HPC Mode [Enabled] (Disabled, Enabled) > - CPB Mode [Auto] (Disabled, Auto) > - C1E Support [Enabled] (Enabled, Disabled) > > When I set PowerNow to disabled the C State Mode, PowerCap and HPC options > also disappear. > After booting with PowerNow disabled (without the cpufreq option) I tried a > kernel compile > twice and some heavy I/O under which the system was stable. > So that seems to have the same effect as cpufreq=none. > > - With PowerNow enabled (C State Mode and HPC / CPB Mode disabled ,no cpufreq > cmdline) > the system also seems stable. > - C State Mode set to C6 (HPC/CPB disabled, no cpufreq cmdline), seems > stable. > - HPC enabled, (CPB disabled, no cpufreq cmdline), crashed (serial console > log attached > as xen-crash-hpc-enabled).Now we''d need to know what HPC actually means (it means nothing to me in this context) - I''d have expected the PowerCap (as referring to P-states) to be the interesting one. In any event - with cpufreq=dom0 and no cpufreq drivers loaded in dom0 (which as Konrad says should be the default), there shouldn''t be any P-state management. But you being able tom suppress the problem with cpufreq=none also suggests that quite likely there''s either a problem with the silicon, or the PowerNow driver in Xen went sufficiently much out of date wrt newer CPUs that it''s not usable anymore (it certainly hasn''t been touched in meaningful ways for quite a while). You may have said so before, but can you confirm that under native Linux with acpi-cpufreq (or the powernow driver) loaded, you don''t have this kind of problem? If so, could you please provide contents of the respective sysfs nodes? Jan
* Jan Beulich <JBeulich@suse.com> [2013-11-07 11:57:17 +0000]:> > - PowerCap [P-state 0] (P-state 0 through 4) > > Now we''d need to know what HPC actually means (it means nothing > to me in this context) - I''d have expected the PowerCap (as referring > to P-states) to be the interesting one.Would you like me to test the PowerCap setting? If so, in combination with the other settings set to what? Note that the PowerCap setting can''t be disabled by itself. (unless P-state 4 counts as disabled?) According to a faq on supermicro.com (this is a Supermicro board after all) http://www.supermicro.com/Aplus/support/faqs/faq.cfm?faq=13400 --- Q: I noticed that the newer BIOS supporting AMD 6200 series CPUs have a P-state HPC Mode option. Can you provide some info on this mode? A: HPC mode only keeps maximum and minimum states. In system idle mode CPU will stay at P4 state for power saving. Once CPU detects higher activities, CPU will jump up to P0 or boost state to reduce clock ramp up latency. ---> In any event - with cpufreq=dom0 and no cpufreq drivers loaded > in dom0 (which as Konrad says should be the default), there > shouldn''t be any P-state management.I thought cpufreq=xen was the default - at least according to http://xenbits.xen.org/docs/unstable/misc/xen-command-line.html> But you being able tom suppress the problem with cpufreq=none > also suggests that quite likely there''s either a problem with the > silicon, or the PowerNow driver in Xen went sufficiently much out > of date wrt newer CPUs that it''s not usable anymore (it certainly > hasn''t been touched in meaningful ways for quite a while). You > may have said so before, but can you confirm that under native > Linux with acpi-cpufreq (or the powernow driver) loaded, you > don''t have this kind of problem? If so, could you please provide > contents of the respective sysfs nodes?I started tinkering on this new machine with (Slackware''s) linux 3.10.17 kernel and had no problems whatsoever. The problems only started after booting Xen with my new custom 3.12 kernel. I just booted the machine with the 3.12-dom0 kernel without Xen. And rebooted since I guess you''re interested in the contents with HPC Mode enabled ;) I''ve attached the output of dmesg to this email. Not sure which sysfs nodes you''re interested in though, there''s: /sys/module/acpi_cpufreq/parameters/acpi_pstate_strict (contents: 0) Then per CPU we have /sys/devices/system/cpu/cpu0/cpufreq with (for CPU 0): affected_cpus -> 0 bios_limit -> 2600000 cpb -> 1 cpuinfo_cur_freq -> 2600000 cpuinfo_max_freq -> 2600000 cpuinfo_min_freq -> 1400000 cpuinfo_transition_latency -> 5000 freqdomain_cpus -> 0 1 related_cpus -> 0 scaling_available_frequencies -> 2600000 1400000 scaling_available_governors -> conservative ondemand userspace powersave performance scaling_cur_freq -> 2600000 scaling_driver -> acpi-cpufreq scaling_governor -> performance scaling_max_freq -> 2600000 scaling_min_freq -> 1400000 scaling_setspeed -> <unsupported> If there''s any other entry you would like to hear please let me know :) Meanwhile the machine is still stable after going through several kernel compilations and some heavy I/O (just for testing). Regards, Wouter. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 07.11.13 at 14:10, Wouter de Geus <benv-xensource.com@junerules.com> wrote: > * Jan Beulich <JBeulich@suse.com> [2013-11-07 11:57:17 +0000]: > >> > - PowerCap [P-state 0] (P-state 0 through 4) >> >> Now we''d need to know what HPC actually means (it means nothing >> to me in this context) - I''d have expected the PowerCap (as referring >> to P-states) to be the interesting one. > > Would you like me to test the PowerCap setting? If so, in combination with > the other settings set to what? Note that the PowerCap setting can''t be > disabled by itself. (unless P-state 4 counts as disabled?) > > According to a faq on supermicro.com (this is a Supermicro board after all) > http://www.supermicro.com/Aplus/support/faqs/faq.cfm?faq=13400 > --- > Q: I noticed that the newer BIOS supporting AMD 6200 series CPUs have a > P-state HPC Mode option. Can you provide some info on this mode? > A: HPC mode only keeps maximum and minimum states. In system idle mode CPU > will stay at P4 state for power saving. Once CPU detects higher > activities, CPU will jump up to P0 or boost state to reduce clock ramp > up latency.That suggests that P4 is the lowest power state (and the only low power one in HPC mode), i.e. not meaning disabled. P0 alone would then mean disabled afaict. And in non-HPC mode I would conclude intermediate states are also allowed, in which case limiting the number of states might be interesting for you to try out. Jan
* Jan Beulich <JBeulich@suse.com> [2013-11-07 13:15:26 +0000]:> That suggests that P4 is the lowest power state (and the only low > power one in HPC mode), i.e. not meaning disabled. P0 alone would > then mean disabled afaict. And in non-HPC mode I would conclude > intermediate states are also allowed, in which case limiting the number > of states might be interesting for you to try out.Well, non-HPC mode works for me, and I don''t really see the advantage of HPC mode anyway. So as far as I''m concerned I''ll leave it off. So if there''s any mode you want me to try, please be specific in what to test and what to report and I''ll try it out :) Regards, Wouter.
Apparently Analagous Threads
- Xen 4.3.1 / Linux 3.12 panic
- sr-iov on Intel 82576 and rhel 7 - would not work
- [ 3009.778974] mcelog:16842 map pfn expected mapping type write-back for [mem 0x0009f000-0x000a0fff], got uncached-minus
- Is: drivers/cpufreq/cpufreq-xen.c Was:Re: [PATCH 2 of 2] linux-xencommons: Load processor-passthru
- Can not modprobe acpi-cpufreq.ko in CentOS 5.2