thr3ads.net - freebsd stable - Page fault in _mca

If this information is useful, please help other people find it:
Share via:

Konstantin Belousov

2021-Feb-05 17:21 UTC

Page fault in _mca_init during startup

On Fri, Feb 05, 2021 at 09:01:26AM -0700, Alan Somers
wrote:> On Fri, Feb 5, 2021 at 7:41 AM Konstantin Belousov <kostikbel at
gmail.com>
> wrote:
> 
> > On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote:
> > > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov <kostikbel
at gmail.com>
> > > wrote:
> > >
> > > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote:
> > > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov <
> > kostikbel at gmail.com>
> > > > > wrote:
> > > > > > Do you have INVARIANTS enabled?  If not, I am
curious if enabling
> > them
> > > > > > would convert that rare page fault into rare
"CPU %d has more MC
> > banks"
> > > > > > assert.
> > > > > >
> > > > > > Also might be the output of the
> > > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do
cpucontrol -m 0x179
> > > > > > /dev/cpuctl$x; done
> > > > > > command will show the issue (0x179 is the MCG_CAP
MSR).
> > > > > > You need to load cpuctl(4) if it is not loaded
yet.
> > > > > >
> > > > >
> > > > > I don't have INVARIANTS enabled, and I can't
enable it on the
> > production
> > > > > servers.  However, I can turn those three KASSERTs into
VERIFYs and
> > see
> > > > > what happens.  Here is what your command shows on the
server that
> > > > panicked:
> > > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo
cpucontrol -m
> > 0x179
> > > > > /dev/cpuctl$x; done | uniq -c
> > > > >   16 MSR 0x179: 0x00000000 0x0f000c14
> > > > >   16 MSR 0x179: 0x00000000 0x0f000814
> > > >
> > > > It probably explains it, but it would be more telling if you
left the
> > > > output as is, so that we can see which CPUs have MCG_CMCI_P
(10) bit
> > set.
> > > >
> > >
> > > I didn't sort them, so the first 16 have bit 10 set and the
second 16
> > > don't.
> > >
> > >
> > > >
> > > > I suspect that your machine has two sockets, and processor
in one
> > socket
> > > > has CPUs reporting MCG_CMCI_P, while other processor does
not. Your SMP
> > > > is not quite symmetric, perhaps processors were from
different bins?
> >
> 
> I found 2 other servers that exhibit the same problem: the first 16 cores
> have bit 10 set and the second 16 don't.  All 3 have dual Xeon Gold
6142
> CPUs and SuperMicro X11DPU motherboards with BIOS revision 5.12.  I have
> other examples of X11DPU motherboards that don't exhibit the problem,
but
> they all have both different CPUs and different BIOS revisions.  So I
can't
> be sure whether the bug follows the CPU model or the BIOS version.I looked at the full spec update errata list for the first gen Skylake
Xeons, but did not noticed anything relevant. EDS doc does not provide
much useful info on the MSR 0x179 bit 10 either, except rewording SDM
definition.

In fact I am not sure but this bit might be writeable by software. Try
to flip the bit with cpucontrol(8). Might be it is a BIOS bug after all.

If you have Intel representative contact, or Supermicro contact, try to
engage them.  I do not have any further ideas, since spec update does not
mention the problem.
> 
> 
> > > >
> > >
> > > Could be.  Is there some MSR that reports a more specific version
number?
> > There are CPUID %eax=1 values returned in %eax, but then it requires
> > some interpretation.
> >         # cpucontrol -i 1 /dev/cpuctl$x
> > for $x iterating over the cpus.
> >
> 
> Apart from the Local APIC ID field, that returns the same value for all
> processors.
> 
> Your second patch doesn't cause any obvious problems on my dev system.I hope that you would confirm that the issue is solved by it, after some
time.

Alan Somers

2021-Feb-07 21:33 UTC

head link

Page fault in _mca_init during startup

On Fri, Feb 5, 2021 at 10:21 AM Konstantin Belousov <kostikbel at
gmail.com>
wrote:
> On Fri, Feb 05, 2021 at 09:01:26AM -0700, Alan Somers wrote:
> > On Fri, Feb 5, 2021 at 7:41 AM Konstantin Belousov <kostikbel at
gmail.com>
> > wrote:
> >
> > > On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote:
> > > > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov <
> kostikbel at gmail.com>
> > > > wrote:
> > > >
> > > > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers
wrote:
> > > > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov
<
> > > kostikbel at gmail.com>
> > > > > > wrote:
> > > > > > > Do you have INVARIANTS enabled?  If not, I am
curious if
> enabling
> > > them
> > > > > > > would convert that rare page fault into rare
"CPU %d has more
> MC
> > > banks"
> > > > > > > assert.
> > > > > > >
> > > > > > > Also might be the output of the
> > > > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do
cpucontrol -m
> 0x179
> > > > > > > /dev/cpuctl$x; done
> > > > > > > command will show the issue (0x179 is the
MCG_CAP MSR).
> > > > > > > You need to load cpuctl(4) if it is not
loaded yet.
> > > > > > >
> > > > > >
> > > > > > I don't have INVARIANTS enabled, and I
can't enable it on the
> > > production
> > > > > > servers.  However, I can turn those three KASSERTs
into VERIFYs
> and
> > > see
> > > > > > what happens.  Here is what your command shows on
the server that
> > > > > panicked:
> > > > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo
cpucontrol -m
> > > 0x179
> > > > > > /dev/cpuctl$x; done | uniq -c
> > > > > >   16 MSR 0x179: 0x00000000 0x0f000c14
> > > > > >   16 MSR 0x179: 0x00000000 0x0f000814
> > > > >
> > > > > It probably explains it, but it would be more telling
if you left
> the
> > > > > output as is, so that we can see which CPUs have
MCG_CMCI_P (10)
> bit
> > > set.
> > > > >
> > > >
> > > > I didn't sort them, so the first 16 have bit 10 set and
the second 16
> > > > don't.
> > > >
> > > >
> > > > >
> > > > > I suspect that your machine has two sockets, and
processor in one
> > > socket
> > > > > has CPUs reporting MCG_CMCI_P, while other processor
does not.
> Your SMP
> > > > > is not quite symmetric, perhaps processors were from
different
> bins?
> > >
> >
> > I found 2 other servers that exhibit the same problem: the first 16
cores
> > have bit 10 set and the second 16 don't.  All 3 have dual Xeon
Gold 6142
> > CPUs and SuperMicro X11DPU motherboards with BIOS revision 5.12.  I
have
> > other examples of X11DPU motherboards that don't exhibit the
problem, but
> > they all have both different CPUs and different BIOS revisions.  So I
> can't
> > be sure whether the bug follows the CPU model or the BIOS version.
> I looked at the full spec update errata list for the first gen Skylake
> Xeons, but did not noticed anything relevant. EDS doc does not provide
> much useful info on the MSR 0x179 bit 10 either, except rewording SDM
> definition.
>
> In fact I am not sure but this bit might be writeable by software. Try
> to flip the bit with cpucontrol(8). Might be it is a BIOS bug after all.
>
> If you have Intel representative contact, or Supermicro contact, try to
> engage them.  I do not have any further ideas, since spec update does not
> mention the problem.
>
> >
> >
> > > > >
> > > >
> > > > Could be.  Is there some MSR that reports a more specific
version
> number?
> > > There are CPUID %eax=1 values returned in %eax, but then it
requires
> > > some interpretation.
> > >         # cpucontrol -i 1 /dev/cpuctl$x
> > > for $x iterating over the cpus.
> > >
> >
> > Apart from the Local APIC ID field, that returns the same value for
all
> > processors.
> >
> > Your second patch doesn't cause any obvious problems on my dev
system.
> I hope that you would confirm that the issue is solved by it, after some
> time.
>
Upgrading the BIOS fixed the problem, by clearing the MCG_CMCI_P bit on all
processors.  I don't have strong opinions about whether we should commit
kib's patch too.  Kib, what do you think?
-Alan

freebsd stable - Feb 2021 - Page fault in _mca_init during startup

Page fault in _mca_init during startup

Page fault in _mca_init during startup