PJ Welsh
2017-Apr-18 13:36 UTC
[CentOS-virt] Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.
There was a note that the non-Xen kernel at the same kernel version did indeed boot: "CentOS-6 4.9.20-26 kernel exhibits the same constant kernel-start-then-reboot issue when booting under the "CentOS Linux, with Xen hypervisor" grub2 menu option. However, it *does* properly boot under the "CentOS Linux (4.9.20-25.el7.x86_64) 7 (Core)" grub2 menu option!" Trying to get back into being able to test this more. Thanks PJ On Tue, Apr 18, 2017 at 8:30 AM, Johnny Hughes <johnny at centos.org> wrote:> On 04/14/2017 03:26 PM, Anderson, Dave wrote: > > Sad to say that I already tested 4.9.20-26 from your repo yesterday...it > does look a little cleaner before it dies, but still dies. I have not > tested it with the vcpu=4 wokaround, but I can tonight if you would like. > Relevant bits below: > > > > Loading Xen 4.6.3-12.el7 ... > > Loading Linux 4.9.20-26.el7.x86_64 ... > > Loading initial ramdisk ... > > [ 0.000000] Linux version 4.9.20-26.el7.x86_64 (mockbuild@) (gcc > version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Apr 4 11:19:26 > CDT 2017 > > > > <snip> > > > > [ 6.195089] smpboot: Max logical packages: 1 > > [ 6.199549] VPMU disabled by hypervisor. > > [ 6.203663] Performance Events: SandyBridge events, PMU not available > due to virtualization, using software events only. > > [ 6.215436] NMI watchdog: disabled (cpu0): hardware events not enabled > > [ 6.222139] NMI watchdog: Shutting down hard lockup detector on all > cpus > > [ 6.229165] installing Xen timer for CPU 1 > > [ 6.233849] installing Xen timer for CPU 2 > > [ 6.238504] installing Xen timer for CPU 3 > > [ 6.243139] installing Xen timer for CPU 4 > > [ 6.247836] installing Xen timer for CPU 5 > > [ 6.252478] installing Xen timer for CPU 6 > > [ 6.257155] installing Xen timer for CPU 7 > > [ 6.261795] installing Xen timer for CPU 8 > > [ 6.266358] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. > > [ 6.272736] ------------[ cut here ]------------ > > [ 6.277358] kernel BUG at arch/x86/kernel/cpu/common.c:997! > > [ 6.280104] random: fast init done > > [ 6.286333] invalid opcode: 0000 [#1] SMP > > [ 6.290343] Modules linked in: > > [ 6.293430] CPU: 8 PID: 0 Comm: swapper/8 Not tainted > 4.9.20-26.el7.x86_64 #1 > > [ 6.300568] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a > 08/04/2015 > > [ 6.307183] task: ffff880058a68000 task.stack: ffffc900400c0000 > > [ 6.313103] RIP: e030:[<ffffffff8103e7e7>] [<ffffffff8103e7e7>] > identify_secondary_cpu+0x57/0x80 > > [ 6.322019] RSP: e02b:ffffc900400c3f08 EFLAGS: 00010086 > > [ 6.327333] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: > ffffffff81e5ffc8 > > [ 6.334473] RDX: 0000000000000001 RSI: 0000000000000005 RDI: > 0000000000000005 > > [ 6.341607] RBP: ffffc900400c3f18 R08: 00000000000000ce R09: > 0000000000000000 > > [ 6.348738] R10: 0000000000000005 R11: 0000000000000006 R12: > 0000000000000008 > > [ 6.355873] R13: 0000000000000000 R14: 0000000000000000 R15: > 0000000000000000 > > [ 6.363006] FS: 0000000000000000(0000) GS:ffff88005d800000(0000) > knlGS:0000000000000000 > > [ 6.371090] CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 > > [ 6.376837] CR2: 0000000000000000 CR3: 0000000001e07000 CR4: > 0000000000042660 > > [ 6.383970] Stack: > > [ 6.386004] 0000000000000008 0000000000000000 ffffc900400c3f28 > ffffffff8104ebce > > [ 6.393483] ffffc900400c3f40 ffffffff81029855 0000000000000000 > ffffc900400c3f50 > > [ 6.400963] ffffffff810298d0 0000000000000000 0000000000000000 > 0000000000000000 > > [ 6.408450] Call Trace: > > [ 6.410907] [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40 > > [ 6.416753] [<ffffffff81029855>] cpu_bringup+0x35/0x90 > > [ 6.421981] [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40 > > [ 6.427987] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c > 0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3 0f > 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 e8 ce ca 81 > > [ 6.448249] RIP [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/ > 0x80 > > [ 6.454801] RSP <ffffc900400c3f08> > > [ 6.458305] ---[ end trace 2f9b62c5c7050204 ]--- > > > > > > So basically, it removes the "[Firmware Bug]: CPU1: APIC id mismatch. > Firmware: 0 APIC: 1" lines, but otherwise dies the same way. I included a > few extra lines up from the panic because the "[ 6.195089] smpboot: Max > logical packages: 1" could possibly be relevant, I need to go look at a > clean boot to see if that was in there on this machine. > > > > > > Even more strangely, in addition to the machine I'm talking about which > panics and reboots, I had a second nearly identical machine (different > CPU/ram config, everything else the same) which booted but had some kind of > hw conflict with 4.9.x that I never had before. It appears to be between > Intel SCU and an intel PCIe NVMe SSD (luckily I wasn't using SCU, so I > disabled that). Had that other machine not booted I would have just assumed > 4.9.X was totally broken and sat on 3.18...so I'm glad that one machine > booted at least :) > > > > Thanks, > > -Dave > > Dave, > > Just for testing purposes, can you try booting the kernel in the normal > way on the machine does does not work (a normal grub entry on the kernel > with no xen.gz line) > > That way, we can hopefully narrow the issue down to a hypervisor issue > or a kernel config issue. > > Thanks, > Johnny Hughes > > > > > > >> On Apr 14, 2017, at 05:39, Johnny Hughes <johnny at centos.org> wrote: > >> > >> Dave, > >> > >> Take a look at this kernel as it is the one I think we are going to > >> release (or a slightly newer 4.9.2x from kernel.org LTS). This version > >> has some newer settings that are more redhat/fedora/centos base kernel > >> like WRT what is a module and what is built into the kernel, etc. > >> > >> https://people.centos.org/hughesjr/4.9.x/ > >> > >> Thanks, > >> Johnny Hughes > >> > >> On 04/14/2017 05:16 AM, Anderson, Dave wrote: > >>> List moderator: feel free to delete my previous large message with > attachments that's in the moderation queue...it's now obsolete anyway. > >>> > >>> > >>> I have found a fix/workaround for my reboot issues with Xen 4.6.3-12 + > Kernel 4.9.13: > >>> > >>> Once I finally got serial output all the way through the boot process > (xen+dom0) I discovered the stack trace: > >>> > >>> [Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7 > >>> installing Xen timer for CPU 8 > >>> [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20 > >>> smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. > >>> ------------[ cut here ]------------ > >>> kernel BUG at arch/x86/kernel/cpu/common.c:997! > >>> invalid opcode: 0000 [#1] SMP > >>> Modules linked in: > >>> CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1 > >>> Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 > >>> random: fast init done > >>> task: ffff880058a8c4c0 task.stack: ffffc900400b4000 > >>> RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>] > identify_secondary_cpu+0x57/0x80 > >>> RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086 > >>> RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68 > >>> RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 > >>> RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004 > >>> R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008 > >>> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > >>> FS: 0000000000000000(0000) GS:ffff88005d800000(0000) > knlGS:0000000000000000 > >>> CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 > >>> CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660 > >>> Stack: > >>> 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e > >>> ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50 > >>> ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000 > >>> Call Trace: > >>> [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40 > >>> [<ffffffff81029925>] cpu_bringup+0x35/0x90 > >>> [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40 > >>> Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da > 00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b > 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81 > >>> RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 > >>> RSP <ffffc900400b7f08> > >>> ---[ end trace dc5563100443876e ]--- > >>> > >>> I surmised that reducing the number of dom0 vcpu might solve this > issue (they were unbounded) > >>> > >>> In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the > GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running > grub2-mkconfig has resulted in the system I have that never booted Xen > 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests. > >>> > >>> > >>> So...I don't know if there's a race condition somewhere, or > what...but...so far this workaround has not failed me. > >>> > >>> Thanks, > >>> -Dave > >>> > >>> > >>> > >>>> On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com > >>>>> wrote: > >>>>> I've not gotten any bites from my posting on the xen-devel mailing > list. > >>>>> Here is the only one to-date: > >>>>> https://lists.xen.org/archives/html/xen-devel/2017-04/msg01069.html > >>>>> > >>>>> From that email, there needs to be some hypervisor messages. > >>>>> > >>>>> Does anyone know how to produce the hypervisor messages? I've already > >>>> > >>>>> removed the rhgb and quiet options from the boot. > >>>> > >>>>> > >>>>> Thanks > >>>>> PJ > >>>> > >>>> > >>>> I spoke too soon. To get more information: Please see > >>>> > >>>> https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project > >>>> > >>>> and > >>>> > >>>> https://wiki.xenproject.org/wiki/Xen_Serial_Console > >>>> > >>>> or alternatively at least add "vga=keep". > >>>> > > > > _______________________________________________ > CentOS-virt mailing list > CentOS-virt at centos.org > https://lists.centos.org/mailman/listinfo/centos-virt > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170418/0289c1b9/attachment-0002.html>
PJ Welsh
2017-Apr-18 13:39 UTC
[CentOS-virt] Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.
Just to note, the same pattern happens on C7: "CentOS Linux, with Xen hypervisor" = reboot "CentOS Linux (4.9.20-26.el7.x86_64) 7 (Core)" = boot [root at XXX ~]# uname -a Linux XXX 4.9.20-25.el7.x86_64 #1 SMP Fri Mar 31 08:53:28 CDT 2017 x86_64 x86_64 x86_64 On Tue, Apr 18, 2017 at 8:36 AM, PJ Welsh <pjwelsh at gmail.com> wrote:> There was a note that the non-Xen kernel at the same kernel version did > indeed boot: > "CentOS-6 4.9.20-26 kernel exhibits the same constant > kernel-start-then-reboot issue when booting under the "CentOS Linux, with > Xen hypervisor" grub2 menu option. However, it *does* properly boot under > the "CentOS Linux (4.9.20-25.el7.x86_64) 7 (Core)" grub2 menu option!" > > Trying to get back into being able to test this more. > > Thanks > PJ > > On Tue, Apr 18, 2017 at 8:30 AM, Johnny Hughes <johnny at centos.org> wrote: > >> On 04/14/2017 03:26 PM, Anderson, Dave wrote: >> > Sad to say that I already tested 4.9.20-26 from your repo >> yesterday...it does look a little cleaner before it dies, but still dies. I >> have not tested it with the vcpu=4 wokaround, but I can tonight if you >> would like. Relevant bits below: >> > >> > Loading Xen 4.6.3-12.el7 ... >> > Loading Linux 4.9.20-26.el7.x86_64 ... >> > Loading initial ramdisk ... >> > [ 0.000000] Linux version 4.9.20-26.el7.x86_64 (mockbuild@) (gcc >> version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Apr 4 11:19:26 >> CDT 2017 >> > >> > <snip> >> > >> > [ 6.195089] smpboot: Max logical packages: 1 >> > [ 6.199549] VPMU disabled by hypervisor. >> > [ 6.203663] Performance Events: SandyBridge events, PMU not >> available due to virtualization, using software events only. >> > [ 6.215436] NMI watchdog: disabled (cpu0): hardware events not >> enabled >> > [ 6.222139] NMI watchdog: Shutting down hard lockup detector on all >> cpus >> > [ 6.229165] installing Xen timer for CPU 1 >> > [ 6.233849] installing Xen timer for CPU 2 >> > [ 6.238504] installing Xen timer for CPU 3 >> > [ 6.243139] installing Xen timer for CPU 4 >> > [ 6.247836] installing Xen timer for CPU 5 >> > [ 6.252478] installing Xen timer for CPU 6 >> > [ 6.257155] installing Xen timer for CPU 7 >> > [ 6.261795] installing Xen timer for CPU 8 >> > [ 6.266358] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. >> > [ 6.272736] ------------[ cut here ]------------ >> > [ 6.277358] kernel BUG at arch/x86/kernel/cpu/common.c:997! >> > [ 6.280104] random: fast init done >> > [ 6.286333] invalid opcode: 0000 [#1] SMP >> > [ 6.290343] Modules linked in: >> > [ 6.293430] CPU: 8 PID: 0 Comm: swapper/8 Not tainted >> 4.9.20-26.el7.x86_64 #1 >> > [ 6.300568] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a >> 08/04/2015 >> > [ 6.307183] task: ffff880058a68000 task.stack: ffffc900400c0000 >> > [ 6.313103] RIP: e030:[<ffffffff8103e7e7>] [<ffffffff8103e7e7>] >> identify_secondary_cpu+0x57/0x80 >> > [ 6.322019] RSP: e02b:ffffc900400c3f08 EFLAGS: 00010086 >> > [ 6.327333] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: >> ffffffff81e5ffc8 >> > [ 6.334473] RDX: 0000000000000001 RSI: 0000000000000005 RDI: >> 0000000000000005 >> > [ 6.341607] RBP: ffffc900400c3f18 R08: 00000000000000ce R09: >> 0000000000000000 >> > [ 6.348738] R10: 0000000000000005 R11: 0000000000000006 R12: >> 0000000000000008 >> > [ 6.355873] R13: 0000000000000000 R14: 0000000000000000 R15: >> 0000000000000000 >> > [ 6.363006] FS: 0000000000000000(0000) GS:ffff88005d800000(0000) >> knlGS:0000000000000000 >> > [ 6.371090] CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 >> > [ 6.376837] CR2: 0000000000000000 CR3: 0000000001e07000 CR4: >> 0000000000042660 >> > [ 6.383970] Stack: >> > [ 6.386004] 0000000000000008 0000000000000000 ffffc900400c3f28 >> ffffffff8104ebce >> > [ 6.393483] ffffc900400c3f40 ffffffff81029855 0000000000000000 >> ffffc900400c3f50 >> > [ 6.400963] ffffffff810298d0 0000000000000000 0000000000000000 >> 0000000000000000 >> > [ 6.408450] Call Trace: >> > [ 6.410907] [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40 >> > [ 6.416753] [<ffffffff81029855>] cpu_bringup+0x35/0x90 >> > [ 6.421981] [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40 >> > [ 6.427987] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c >> 0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3 0f >> 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 e8 ce ca 81 >> > [ 6.448249] RIP [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x >> 80 >> > [ 6.454801] RSP <ffffc900400c3f08> >> > [ 6.458305] ---[ end trace 2f9b62c5c7050204 ]--- >> > >> > >> > So basically, it removes the "[Firmware Bug]: CPU1: APIC id mismatch. >> Firmware: 0 APIC: 1" lines, but otherwise dies the same way. I included a >> few extra lines up from the panic because the "[ 6.195089] smpboot: Max >> logical packages: 1" could possibly be relevant, I need to go look at a >> clean boot to see if that was in there on this machine. >> > >> > >> > Even more strangely, in addition to the machine I'm talking about which >> panics and reboots, I had a second nearly identical machine (different >> CPU/ram config, everything else the same) which booted but had some kind of >> hw conflict with 4.9.x that I never had before. It appears to be between >> Intel SCU and an intel PCIe NVMe SSD (luckily I wasn't using SCU, so I >> disabled that). Had that other machine not booted I would have just assumed >> 4.9.X was totally broken and sat on 3.18...so I'm glad that one machine >> booted at least :) >> > >> > Thanks, >> > -Dave >> >> Dave, >> >> Just for testing purposes, can you try booting the kernel in the normal >> way on the machine does does not work (a normal grub entry on the kernel >> with no xen.gz line) >> >> That way, we can hopefully narrow the issue down to a hypervisor issue >> or a kernel config issue. >> >> Thanks, >> Johnny Hughes >> >> > >> > >> >> On Apr 14, 2017, at 05:39, Johnny Hughes <johnny at centos.org> wrote: >> >> >> >> Dave, >> >> >> >> Take a look at this kernel as it is the one I think we are going to >> >> release (or a slightly newer 4.9.2x from kernel.org LTS). This version >> >> has some newer settings that are more redhat/fedora/centos base kernel >> >> like WRT what is a module and what is built into the kernel, etc. >> >> >> >> https://people.centos.org/hughesjr/4.9.x/ >> >> >> >> Thanks, >> >> Johnny Hughes >> >> >> >> On 04/14/2017 05:16 AM, Anderson, Dave wrote: >> >>> List moderator: feel free to delete my previous large message with >> attachments that's in the moderation queue...it's now obsolete anyway. >> >>> >> >>> >> >>> I have found a fix/workaround for my reboot issues with Xen 4.6.3-12 >> + Kernel 4.9.13: >> >>> >> >>> Once I finally got serial output all the way through the boot process >> (xen+dom0) I discovered the stack trace: >> >>> >> >>> [Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7 >> >>> installing Xen timer for CPU 8 >> >>> [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20 >> >>> smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. >> >>> ------------[ cut here ]------------ >> >>> kernel BUG at arch/x86/kernel/cpu/common.c:997! >> >>> invalid opcode: 0000 [#1] SMP >> >>> Modules linked in: >> >>> CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1 >> >>> Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 >> >>> random: fast init done >> >>> task: ffff880058a8c4c0 task.stack: ffffc900400b4000 >> >>> RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>] >> identify_secondary_cpu+0x57/0x80 >> >>> RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086 >> >>> RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68 >> >>> RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 >> >>> RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004 >> >>> R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008 >> >>> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 >> >>> FS: 0000000000000000(0000) GS:ffff88005d800000(0000) >> knlGS:0000000000000000 >> >>> CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 >> >>> CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660 >> >>> Stack: >> >>> 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e >> >>> ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50 >> >>> ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000 >> >>> Call Trace: >> >>> [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40 >> >>> [<ffffffff81029925>] cpu_bringup+0x35/0x90 >> >>> [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40 >> >>> Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da >> 00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b >> 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81 >> >>> RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 >> >>> RSP <ffffc900400b7f08> >> >>> ---[ end trace dc5563100443876e ]--- >> >>> >> >>> I surmised that reducing the number of dom0 vcpu might solve this >> issue (they were unbounded) >> >>> >> >>> In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the >> GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running >> grub2-mkconfig has resulted in the system I have that never booted Xen >> 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests. >> >>> >> >>> >> >>> So...I don't know if there's a race condition somewhere, or >> what...but...so far this workaround has not failed me. >> >>> >> >>> Thanks, >> >>> -Dave >> >>> >> >>> >> >>> >> >>>> On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com >> >>>>> wrote: >> >>>>> I've not gotten any bites from my posting on the xen-devel mailing >> list. >> >>>>> Here is the only one to-date: >> >>>>> https://lists.xen.org/archives/html/xen-devel/2017-04/msg01069.html >> >>>>> >> >>>>> From that email, there needs to be some hypervisor messages. >> >>>>> >> >>>>> Does anyone know how to produce the hypervisor messages? I've >> already >> >>>> >> >>>>> removed the rhgb and quiet options from the boot. >> >>>> >> >>>>> >> >>>>> Thanks >> >>>>> PJ >> >>>> >> >>>> >> >>>> I spoke too soon. To get more information: Please see >> >>>> >> >>>> https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project >> >>>> >> >>>> and >> >>>> >> >>>> https://wiki.xenproject.org/wiki/Xen_Serial_Console >> >>>> >> >>>> or alternatively at least add "vga=keep". >> >>>> >> >> >> >> _______________________________________________ >> CentOS-virt mailing list >> CentOS-virt at centos.org >> https://lists.centos.org/mailman/listinfo/centos-virt >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170418/24cb0839/attachment-0002.html>
PJ Welsh
2017-Apr-18 13:44 UTC
[CentOS-virt] Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.
Apologies: I installed the newer -26 kernel and had not rebooted into it. The grub2 menu item should have been "CentOS Linux (4.9.20-25.el7.x86_64) 7 (Core)". I am currently restarting that remote affected system (unmodified grub2 entry first). Thanks PJ On Tue, Apr 18, 2017 at 8:39 AM, PJ Welsh <pjwelsh at gmail.com> wrote:> Just to note, the same pattern happens on C7: > "CentOS Linux, with Xen hypervisor" = reboot > "CentOS Linux (4.9.20-26.el7.x86_64) 7 (Core)" = boot > > [root at XXX ~]# uname -a > Linux XXX 4.9.20-25.el7.x86_64 #1 SMP Fri Mar 31 08:53:28 CDT 2017 x86_64 > x86_64 x86_64 > > On Tue, Apr 18, 2017 at 8:36 AM, PJ Welsh <pjwelsh at gmail.com> wrote: > >> There was a note that the non-Xen kernel at the same kernel version did >> indeed boot: >> "CentOS-6 4.9.20-26 kernel exhibits the same constant >> kernel-start-then-reboot issue when booting under the "CentOS Linux, with >> Xen hypervisor" grub2 menu option. However, it *does* properly boot under >> the "CentOS Linux (4.9.20-25.el7.x86_64) 7 (Core)" grub2 menu option!" >> >> Trying to get back into being able to test this more. >> >> Thanks >> PJ >> >> On Tue, Apr 18, 2017 at 8:30 AM, Johnny Hughes <johnny at centos.org> wrote: >> >>> On 04/14/2017 03:26 PM, Anderson, Dave wrote: >>> > Sad to say that I already tested 4.9.20-26 from your repo >>> yesterday...it does look a little cleaner before it dies, but still dies. I >>> have not tested it with the vcpu=4 wokaround, but I can tonight if you >>> would like. Relevant bits below: >>> > >>> > Loading Xen 4.6.3-12.el7 ... >>> > Loading Linux 4.9.20-26.el7.x86_64 ... >>> > Loading initial ramdisk ... >>> > [ 0.000000] Linux version 4.9.20-26.el7.x86_64 (mockbuild@) (gcc >>> version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Tue Apr 4 11:19:26 >>> CDT 2017 >>> > >>> > <snip> >>> > >>> > [ 6.195089] smpboot: Max logical packages: 1 >>> > [ 6.199549] VPMU disabled by hypervisor. >>> > [ 6.203663] Performance Events: SandyBridge events, PMU not >>> available due to virtualization, using software events only. >>> > [ 6.215436] NMI watchdog: disabled (cpu0): hardware events not >>> enabled >>> > [ 6.222139] NMI watchdog: Shutting down hard lockup detector on all >>> cpus >>> > [ 6.229165] installing Xen timer for CPU 1 >>> > [ 6.233849] installing Xen timer for CPU 2 >>> > [ 6.238504] installing Xen timer for CPU 3 >>> > [ 6.243139] installing Xen timer for CPU 4 >>> > [ 6.247836] installing Xen timer for CPU 5 >>> > [ 6.252478] installing Xen timer for CPU 6 >>> > [ 6.257155] installing Xen timer for CPU 7 >>> > [ 6.261795] installing Xen timer for CPU 8 >>> > [ 6.266358] smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. >>> > [ 6.272736] ------------[ cut here ]------------ >>> > [ 6.277358] kernel BUG at arch/x86/kernel/cpu/common.c:997! >>> > [ 6.280104] random: fast init done >>> > [ 6.286333] invalid opcode: 0000 [#1] SMP >>> > [ 6.290343] Modules linked in: >>> > [ 6.293430] CPU: 8 PID: 0 Comm: swapper/8 Not tainted >>> 4.9.20-26.el7.x86_64 #1 >>> > [ 6.300568] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a >>> 08/04/2015 >>> > [ 6.307183] task: ffff880058a68000 task.stack: ffffc900400c0000 >>> > [ 6.313103] RIP: e030:[<ffffffff8103e7e7>] [<ffffffff8103e7e7>] >>> identify_secondary_cpu+0x57/0x80 >>> > [ 6.322019] RSP: e02b:ffffc900400c3f08 EFLAGS: 00010086 >>> > [ 6.327333] RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: >>> ffffffff81e5ffc8 >>> > [ 6.334473] RDX: 0000000000000001 RSI: 0000000000000005 RDI: >>> 0000000000000005 >>> > [ 6.341607] RBP: ffffc900400c3f18 R08: 00000000000000ce R09: >>> 0000000000000000 >>> > [ 6.348738] R10: 0000000000000005 R11: 0000000000000006 R12: >>> 0000000000000008 >>> > [ 6.355873] R13: 0000000000000000 R14: 0000000000000000 R15: >>> 0000000000000000 >>> > [ 6.363006] FS: 0000000000000000(0000) GS:ffff88005d800000(0000) >>> knlGS:0000000000000000 >>> > [ 6.371090] CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 >>> > [ 6.376837] CR2: 0000000000000000 CR3: 0000000001e07000 CR4: >>> 0000000000042660 >>> > [ 6.383970] Stack: >>> > [ 6.386004] 0000000000000008 0000000000000000 ffffc900400c3f28 >>> ffffffff8104ebce >>> > [ 6.393483] ffffc900400c3f40 ffffffff81029855 0000000000000000 >>> ffffc900400c3f50 >>> > [ 6.400963] ffffffff810298d0 0000000000000000 0000000000000000 >>> 0000000000000000 >>> > [ 6.408450] Call Trace: >>> > [ 6.410907] [<ffffffff8104ebce>] smp_store_cpu_info+0x3e/0x40 >>> > [ 6.416753] [<ffffffff81029855>] cpu_bringup+0x35/0x90 >>> > [ 6.421981] [<ffffffff810298d0>] cpu_bringup_and_idle+0x20/0x40 >>> > [ 6.427987] Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 >>> 1c 0f b7 bb da 00 00 00 44 89 e6 e8 e4 02 01 00 85 c0 75 07 5b 41 5c 5d c3 >>> 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 e8 ce ca 81 >>> > [ 6.448249] RIP [<ffffffff8103e7e7>] identify_secondary_cpu+0x57/0x >>> 80 >>> > [ 6.454801] RSP <ffffc900400c3f08> >>> > [ 6.458305] ---[ end trace 2f9b62c5c7050204 ]--- >>> > >>> > >>> > So basically, it removes the "[Firmware Bug]: CPU1: APIC id mismatch. >>> Firmware: 0 APIC: 1" lines, but otherwise dies the same way. I included a >>> few extra lines up from the panic because the "[ 6.195089] smpboot: Max >>> logical packages: 1" could possibly be relevant, I need to go look at a >>> clean boot to see if that was in there on this machine. >>> > >>> > >>> > Even more strangely, in addition to the machine I'm talking about >>> which panics and reboots, I had a second nearly identical machine >>> (different CPU/ram config, everything else the same) which booted but had >>> some kind of hw conflict with 4.9.x that I never had before. It appears to >>> be between Intel SCU and an intel PCIe NVMe SSD (luckily I wasn't using >>> SCU, so I disabled that). Had that other machine not booted I would have >>> just assumed 4.9.X was totally broken and sat on 3.18...so I'm glad that >>> one machine booted at least :) >>> > >>> > Thanks, >>> > -Dave >>> >>> Dave, >>> >>> Just for testing purposes, can you try booting the kernel in the normal >>> way on the machine does does not work (a normal grub entry on the kernel >>> with no xen.gz line) >>> >>> That way, we can hopefully narrow the issue down to a hypervisor issue >>> or a kernel config issue. >>> >>> Thanks, >>> Johnny Hughes >>> >>> > >>> > >>> >> On Apr 14, 2017, at 05:39, Johnny Hughes <johnny at centos.org> wrote: >>> >> >>> >> Dave, >>> >> >>> >> Take a look at this kernel as it is the one I think we are going to >>> >> release (or a slightly newer 4.9.2x from kernel.org LTS). This >>> version >>> >> has some newer settings that are more redhat/fedora/centos base kernel >>> >> like WRT what is a module and what is built into the kernel, etc. >>> >> >>> >> https://people.centos.org/hughesjr/4.9.x/ >>> >> >>> >> Thanks, >>> >> Johnny Hughes >>> >> >>> >> On 04/14/2017 05:16 AM, Anderson, Dave wrote: >>> >>> List moderator: feel free to delete my previous large message with >>> attachments that's in the moderation queue...it's now obsolete anyway. >>> >>> >>> >>> >>> >>> I have found a fix/workaround for my reboot issues with Xen 4.6.3-12 >>> + Kernel 4.9.13: >>> >>> >>> >>> Once I finally got serial output all the way through the boot >>> process (xen+dom0) I discovered the stack trace: >>> >>> >>> >>> [Firmware Bug]: CPU7: APIC id mismatch. Firmware: 0 APIC: 7 >>> >>> installing Xen timer for CPU 8 >>> >>> [Firmware Bug]: CPU8: APIC id mismatch. Firmware: 0 APIC: 20 >>> >>> smpboot: Package 1 of CPU 8 exceeds BIOS package data 1. >>> >>> ------------[ cut here ]------------ >>> >>> kernel BUG at arch/x86/kernel/cpu/common.c:997! >>> >>> invalid opcode: 0000 [#1] SMP >>> >>> Modules linked in: >>> >>> CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.9.13-22.el7.x86_64 #1 >>> >>> Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.2a 08/04/2015 >>> >>> random: fast init done >>> >>> task: ffff880058a8c4c0 task.stack: ffffc900400b4000 >>> >>> RIP: e030:[<ffffffff8103e527>] [<ffffffff8103e527>] >>> identify_secondary_cpu+0x57/0x80 >>> >>> RSP: e02b:ffffc900400b7f08 EFLAGS: 00010086 >>> >>> RAX: 00000000ffffffe4 RBX: ffff88005d80a020 RCX: ffffffff81c5be68 >>> >>> RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000000000000005 >>> >>> RBP: ffffc900400b7f18 R08: 00000000000000cb R09: 0000000000000004 >>> >>> R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000008 >>> >>> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 >>> >>> FS: 0000000000000000(0000) GS:ffff88005d800000(0000) >>> knlGS:0000000000000000 >>> >>> CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 >>> >>> CR2: 0000000000000000 CR3: 0000000001c07000 CR4: 0000000000042660 >>> >>> Stack: >>> >>> 0000000000000008 0000000000000000 ffffc900400b7f28 ffffffff8104e94e >>> >>> ffffc900400b7f40 ffffffff81029925 0000000000000000 ffffc900400b7f50 >>> >>> ffffffff810299a0 0000000000000000 0000000000000000 0000000000000000 >>> >>> Call Trace: >>> >>> [<ffffffff8104e94e>] smp_store_cpu_info+0x3e/0x40 >>> >>> [<ffffffff81029925>] cpu_bringup+0x35/0x90 >>> >>> [<ffffffff810299a0>] cpu_bringup_and_idle+0x20/0x40 >>> >>> Code: 44 89 e7 ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 1c 0f b7 bb da >>> 00 00 00 44 89 e6 e8 24 03 01 00 85 c0 75 07 5b 41 5c 5d c3 0f 0b <0f> 0b >>> 0f b7 8b d4 00 00 00 89 c2 44 89 e6 48 c7 c7 98 87 a6 81 >>> >>> RIP [<ffffffff8103e527>] identify_secondary_cpu+0x57/0x80 >>> >>> RSP <ffffc900400b7f08> >>> >>> ---[ end trace dc5563100443876e ]--- >>> >>> >>> >>> I surmised that reducing the number of dom0 vcpu might solve this >>> issue (they were unbounded) >>> >>> >>> >>> In testing adding "dom0_max_vcpus=4 dom0_vcpus_pin" to the >>> GRUB_CMDLINE_XEN_DEFAULT line in /etc/defaults/grub and re-running >>> grub2-mkconfig has resulted in the system I have that never booted Xen >>> 4.6.3-12 + Kernel 4.9.13, booting every single time out of 5-10 tests. >>> >>> >>> >>> >>> >>> So...I don't know if there's a race condition somewhere, or >>> what...but...so far this workaround has not failed me. >>> >>> >>> >>> Thanks, >>> >>> -Dave >>> >>> >>> >>> >>> >>> >>> >>>> On Fri, Apr 7, 2017 at 6:58 AM, PJ Welsh <pjwelsh at gmail.com >>> >>>>> wrote: >>> >>>>> I've not gotten any bites from my posting on the xen-devel mailing >>> list. >>> >>>>> Here is the only one to-date: >>> >>>>> https://lists.xen.org/archives/html/xen-devel/2017-04/msg010 >>> 69.html >>> >>>>> >>> >>>>> From that email, there needs to be some hypervisor messages. >>> >>>>> >>> >>>>> Does anyone know how to produce the hypervisor messages? I've >>> already >>> >>>> >>> >>>>> removed the rhgb and quiet options from the boot. >>> >>>> >>> >>>>> >>> >>>>> Thanks >>> >>>>> PJ >>> >>>> >>> >>>> >>> >>>> I spoke too soon. To get more information: Please see >>> >>>> >>> >>>> https://wiki.xenproject.org/wiki/Reporting_Bugs_against_Xen_Project >>> >>>> >>> >>>> and >>> >>>> >>> >>>> https://wiki.xenproject.org/wiki/Xen_Serial_Console >>> >>>> >>> >>>> or alternatively at least add "vga=keep". >>> >>>> >>> >>> >>> >>> _______________________________________________ >>> CentOS-virt mailing list >>> CentOS-virt at centos.org >>> https://lists.centos.org/mailman/listinfo/centos-virt >>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170418/795b9a6d/attachment-0002.html>
Apparently Analagous Threads
- Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.
- Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.
- Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.
- Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.
- Xen C6 kernel 4.9.13 and testing 4.9.15 only reboots.