On Thu, Feb 04, 2021 at 01:34:13PM -0800, Matthew Macy wrote:> On Thu, Feb 4, 2021 at 1:31 PM Alan Somers <asomers at freebsd.org> wrote: > > > > After upgrading a machine to FreeBSD, 12.2, it hit the following panic on > > its first reboot. I suspect that a few other servers have hit this too, > > but since it happens before swap is mounted there are no core dumps, and > > they usually reboot immediately. The code in question hasn't changed since > > 2018. The panic happened in cmci_monitor at line 930. Does anybody have > > any suggestions for how I could debug further? I can't readily reproduce > > it, and I can't dump core, but I'd like to investigate it any way I can. > > The server in question has dual Xeon Gold 6142 CPUs. > > > > I can't actually help :( but I can add a +1 with similar hardware or > equivalent specs. It's not frequent, but it's often enough to be > annoying. > -M > > > if (!(ctl & MC_CTL2_CMCI_EN)) > > /* This bank does not support CMCI. */ > > return; > > > > cc = &cmc_state[PCPU_GET(cpuid)][i]; // <- panic here > > > > /* Determine maximum threshold. */ > > > > > > Fatal trap 12: page fault while in kernel mode > > cpuid = 26; apic id = 34 > > fault virtual address = 0xd0 > > fault code = supervisor read data, page not present > > instruction pointer = 0x20:0xffffffff8125a009 > > stack pointer = 0x28:0xfffffe0000b65f20 > > frame pointer = 0x28:0xfffffe0000b65f50 > > code segment = base 0x0, limit 0xfffff, type 0x1b > > = DPL 0, pres 1, long 1, def32 0, gran 1 > > processor eflags = resume, IOPL = 0 > > current process = 11 (idle: cpu26) > > trap number = 12 > > panic: page fault > > cpuid = 26 > > time = 1 > > KDB: stack backtrace: > > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame > > 0xfffffe0000b65be0 > > vpanic() at vpanic+0x17b/frame 0xfffffe0000b65c30 > > panic() at panic+0x43/frame 0xfffffe0000b65c90 > > trap_fatal() at trap_fatal+0x391/frame 0xfffffe0000b65cf0 > > trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0000b65d40 > > trap() at trap+0x286/frame 0xfffffe0000b65e50 > > calltrap() at calltrap+0x8/frame 0xfffffe0000b65e50 > > --- trap 0xc, rip = 0xffffffff8125a009, rsp = 0xfffffe0000b65f20, rbp > > 0xfffffe0000b65f50 --- > > _mca_init() at _mca_init+0x5d9/frame 0xfffffe0000b65f50 > > init_secondary_tail() at init_secondary_tail+0xfd/frame 0xfffffe0000b65f80 > > init_secondary() at init_secondary+0x2d1/frame 0xfffffe0000b65ff0 > > KDB: enter: panic > > [ thread pid 11 tid 100029 ] > > Stopped at kdb_enter+0x37: movq $0,0x12bc1f6(%rip)Try this. I think that there is no other dependencies in the startup order, but cannot know it for sure. commit 19584e3d3e9606d591fa30999b370ed758960e8c Author: Konstantin Belousov <kib at FreeBSD.org> Date: Fri Feb 5 00:56:09 2021 +0200 x86: init mca before APs are started diff --git a/sys/x86/x86/mca.c b/sys/x86/x86/mca.c index 03100e77d455..e2bf2673cf69 100644 --- a/sys/x86/x86/mca.c +++ b/sys/x86/x86/mca.c @@ -1371,7 +1371,7 @@ mca_init_bsp(void *arg __unused) mca_init(); } -SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_ANY, mca_init_bsp, NULL); +SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_SECOND, mca_init_bsp, NULL); /* Called when a machine check exception fires. */ void
On Thu, Feb 4, 2021 at 3:58 PM Konstantin Belousov <kostikbel at gmail.com> wrote:> On Thu, Feb 04, 2021 at 01:34:13PM -0800, Matthew Macy wrote: > > On Thu, Feb 4, 2021 at 1:31 PM Alan Somers <asomers at freebsd.org> wrote: > > > > > > After upgrading a machine to FreeBSD, 12.2, it hit the following panic > on > > > its first reboot. I suspect that a few other servers have hit this > too, > > > but since it happens before swap is mounted there are no core dumps, > and > > > they usually reboot immediately. The code in question hasn't changed > since > > > 2018. The panic happened in cmci_monitor at line 930. Does anybody > have > > > any suggestions for how I could debug further? I can't readily > reproduce > > > it, and I can't dump core, but I'd like to investigate it any way I > can. > > > The server in question has dual Xeon Gold 6142 CPUs. > > > > > > > I can't actually help :( but I can add a +1 with similar hardware or > > equivalent specs. It's not frequent, but it's often enough to be > > annoying. > > -M > > > > > if (!(ctl & MC_CTL2_CMCI_EN)) > > > /* This bank does not support CMCI. */ > > > return; > > > > > > cc = &cmc_state[PCPU_GET(cpuid)][i]; // <- panic here > > > > > > /* Determine maximum threshold. */ > > > > > > > > > Fatal trap 12: page fault while in kernel mode > > > cpuid = 26; apic id = 34 > > > fault virtual address = 0xd0 > > > fault code = supervisor read data, page not present > > > instruction pointer = 0x20:0xffffffff8125a009 > > > stack pointer = 0x28:0xfffffe0000b65f20 > > > frame pointer = 0x28:0xfffffe0000b65f50 > > > code segment = base 0x0, limit 0xfffff, type 0x1b > > > = DPL 0, pres 1, long 1, def32 0, gran 1 > > > processor eflags = resume, IOPL = 0 > > > current process = 11 (idle: cpu26) > > > trap number = 12 > > > panic: page fault > > > cpuid = 26 > > > time = 1 > > > KDB: stack backtrace: > > > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame > > > 0xfffffe0000b65be0 > > > vpanic() at vpanic+0x17b/frame 0xfffffe0000b65c30 > > > panic() at panic+0x43/frame 0xfffffe0000b65c90 > > > trap_fatal() at trap_fatal+0x391/frame 0xfffffe0000b65cf0 > > > trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0000b65d40 > > > trap() at trap+0x286/frame 0xfffffe0000b65e50 > > > calltrap() at calltrap+0x8/frame 0xfffffe0000b65e50 > > > --- trap 0xc, rip = 0xffffffff8125a009, rsp = 0xfffffe0000b65f20, rbp > > > 0xfffffe0000b65f50 --- > > > _mca_init() at _mca_init+0x5d9/frame 0xfffffe0000b65f50 > > > init_secondary_tail() at init_secondary_tail+0xfd/frame > 0xfffffe0000b65f80 > > > init_secondary() at init_secondary+0x2d1/frame 0xfffffe0000b65ff0 > > > KDB: enter: panic > > > [ thread pid 11 tid 100029 ] > > > Stopped at kdb_enter+0x37: movq $0,0x12bc1f6(%rip) > > Try this. > > I think that there is no other dependencies in the startup order, but > cannot know it for sure. > > commit 19584e3d3e9606d591fa30999b370ed758960e8c > Author: Konstantin Belousov <kib at FreeBSD.org> > Date: Fri Feb 5 00:56:09 2021 +0200 > > x86: init mca before APs are started > > diff --git a/sys/x86/x86/mca.c b/sys/x86/x86/mca.c > index 03100e77d455..e2bf2673cf69 100644 > --- a/sys/x86/x86/mca.c > +++ b/sys/x86/x86/mca.c > @@ -1371,7 +1371,7 @@ mca_init_bsp(void *arg __unused) > > mca_init(); > } > -SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_ANY, mca_init_bsp, NULL); > +SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_SECOND, mca_init_bsp, NULL); > > /* Called when a machine check exception fires. */ > void >I can test this patch on development servers, but so far I've only seen the crash on production servers. Do you have any suggestions for how to force the crash, or how to test this patch besides simply making sure that my dev servers can boot? -Alan
On Fri, Feb 05, 2021 at 12:58:34AM +0200, Konstantin Belousov wrote:> On Thu, Feb 04, 2021 at 01:34:13PM -0800, Matthew Macy wrote: > > On Thu, Feb 4, 2021 at 1:31 PM Alan Somers <asomers at freebsd.org> wrote: > > > > > > After upgrading a machine to FreeBSD, 12.2, it hit the following panic on > > > its first reboot. I suspect that a few other servers have hit this too, > > > but since it happens before swap is mounted there are no core dumps, and > > > they usually reboot immediately. The code in question hasn't changed since > > > 2018. The panic happened in cmci_monitor at line 930. Does anybody have > > > any suggestions for how I could debug further? I can't readily reproduce > > > it, and I can't dump core, but I'd like to investigate it any way I can. > > > The server in question has dual Xeon Gold 6142 CPUs. > > > > Try this. > > I think that there is no other dependencies in the startup order, but > cannot know it for sure. > > commit 19584e3d3e9606d591fa30999b370ed758960e8c > Author: Konstantin Belousov <kib at FreeBSD.org> > Date: Fri Feb 5 00:56:09 2021 +0200 > > x86: init mca before APs are startedAPs only call mca_init() after they have been released by the BSP though, and that happens later in SI_SUB_SMP.> diff --git a/sys/x86/x86/mca.c b/sys/x86/x86/mca.c > index 03100e77d455..e2bf2673cf69 100644 > --- a/sys/x86/x86/mca.c > +++ b/sys/x86/x86/mca.c > @@ -1371,7 +1371,7 @@ mca_init_bsp(void *arg __unused) > > mca_init(); > } > -SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_ANY, mca_init_bsp, NULL); > +SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_SECOND, mca_init_bsp, NULL); > > /* Called when a machine check exception fires. */ > void > _______________________________________________ > freebsd-stable at freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"