On Thu, Feb 4, 2021 at 4:27 PM Mark Johnston <markj at freebsd.org> wrote:
> On Fri, Feb 05, 2021 at 12:58:34AM +0200, Konstantin Belousov wrote:
> > On Thu, Feb 04, 2021 at 01:34:13PM -0800, Matthew Macy wrote:
> > > On Thu, Feb 4, 2021 at 1:31 PM Alan Somers <asomers at
freebsd.org>
> wrote:
> > > >
> > > > After upgrading a machine to FreeBSD, 12.2, it hit the
following
> panic on
> > > > its first reboot. I suspect that a few other servers have
hit this
> too,
> > > > but since it happens before swap is mounted there are no
core dumps,
> and
> > > > they usually reboot immediately. The code in question
hasn't
> changed since
> > > > 2018. The panic happened in cmci_monitor at line 930. Does
anybody
> have
> > > > any suggestions for how I could debug further? I can't
readily
> reproduce
> > > > it, and I can't dump core, but I'd like to
investigate it any way I
> can.
> > > > The server in question has dual Xeon Gold 6142 CPUs.
> > > >
> > Try this.
> >
> > I think that there is no other dependencies in the startup order, but
> > cannot know it for sure.
> >
> > commit 19584e3d3e9606d591fa30999b370ed758960e8c
> > Author: Konstantin Belousov <kib at FreeBSD.org>
> > Date: Fri Feb 5 00:56:09 2021 +0200
> >
> > x86: init mca before APs are started
>
> APs only call mca_init() after they have been released by the BSP
> though, and that happens later in SI_SUB_SMP.
>
> > diff --git a/sys/x86/x86/mca.c b/sys/x86/x86/mca.c
> > index 03100e77d455..e2bf2673cf69 100644
> > --- a/sys/x86/x86/mca.c
> > +++ b/sys/x86/x86/mca.c
> > @@ -1371,7 +1371,7 @@ mca_init_bsp(void *arg __unused)
> >
> > mca_init();
> > }
> > -SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_ANY, mca_init_bsp, NULL);
> > +SYSINIT(mca_init_bsp, SI_SUB_CPU, SI_ORDER_SECOND, mca_init_bsp,
NULL);
> >
> > /* Called when a machine check exception fires. */
> > void
>
kib's patch causes a different problem, and this one is reproducible:
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x18
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff8125762c
stack pointer = 0x28:0xffffffff828dad90
frame pointer = 0x28:0xffffffff828dad90
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 0 ()
trap number = 12
panic: page fault
cpuid = 0
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
0xffffffff828daa50
vpanic() at vpanic+0x17b/frame 0xffffffff828daaa0
panic() at panic+0x43/frame 0xffffffff828dab00
trap_fatal() at trap_fatal+0x391/frame 0xffffffff828dab60
trap_pfault() at trap_pfault+0x4f/frame 0xffffffff828dabb0
trap() at trap+0x286/frame 0xffffffff828dacc0
calltrap() at calltrap+0x8/frame 0xffffffff828dacc0
--- trap 0xc, rip = 0xffffffff8125762c, rsp = 0xffffffff828dad90, rbp
0xffffffff828dad90 ---
native_lapic_enable_cmc() at native_lapic_enable_cmc+0x1c/frame
0xffffffff828dad90
_mca_init() at _mca_init+0x94c/frame 0xffffffff828dadd0
mi_startup() at mi_startup+0xdf/frame 0xffffffff828dadf0
btext() at btext+0x2c
KDB: enter: panic
[ thread pid 0 tid 0 ]
Stopped at kdb_enter+0x37: movq $0,0x12bc396(%rip)
If you're wondering, the panic happens at this point in
native_lapic_enable_cmc:
apic_id = PCPU_GET(apic_id);
KASSERT(lapics[apic_id].la_present,
("%s: missing APIC %u", __func__, apic_id));
lapics[apic_id].la_lvts[APIC_LVT_CMCI].lvt_masked = 0; <- panic here
lapics[apic_id].la_lvts[APIC_LVT_CMCI].lvt_active = 1;
if (bootverbose)
printf("lapic%u: CMCI unmasked\n", apic_id);
}