On Fri, Feb 5, 2021 at 10:21 AM Konstantin Belousov <kostikbel at
gmail.com>
wrote:
> On Fri, Feb 05, 2021 at 09:01:26AM -0700, Alan Somers wrote:
> > On Fri, Feb 5, 2021 at 7:41 AM Konstantin Belousov <kostikbel at
gmail.com>
> > wrote:
> >
> > > On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote:
> > > > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov <
> kostikbel at gmail.com>
> > > > wrote:
> > > >
> > > > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers
wrote:
> > > > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov
<
> > > kostikbel at gmail.com>
> > > > > > wrote:
> > > > > > > Do you have INVARIANTS enabled? If not, I am
curious if
> enabling
> > > them
> > > > > > > would convert that rare page fault into rare
"CPU %d has more
> MC
> > > banks"
> > > > > > > assert.
> > > > > > >
> > > > > > > Also might be the output of the
> > > > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do
cpucontrol -m
> 0x179
> > > > > > > /dev/cpuctl$x; done
> > > > > > > command will show the issue (0x179 is the
MCG_CAP MSR).
> > > > > > > You need to load cpuctl(4) if it is not
loaded yet.
> > > > > > >
> > > > > >
> > > > > > I don't have INVARIANTS enabled, and I
can't enable it on the
> > > production
> > > > > > servers. However, I can turn those three KASSERTs
into VERIFYs
> and
> > > see
> > > > > > what happens. Here is what your command shows on
the server that
> > > > > panicked:
> > > > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo
cpucontrol -m
> > > 0x179
> > > > > > /dev/cpuctl$x; done | uniq -c
> > > > > > 16 MSR 0x179: 0x00000000 0x0f000c14
> > > > > > 16 MSR 0x179: 0x00000000 0x0f000814
> > > > >
> > > > > It probably explains it, but it would be more telling
if you left
> the
> > > > > output as is, so that we can see which CPUs have
MCG_CMCI_P (10)
> bit
> > > set.
> > > > >
> > > >
> > > > I didn't sort them, so the first 16 have bit 10 set and
the second 16
> > > > don't.
> > > >
> > > >
> > > > >
> > > > > I suspect that your machine has two sockets, and
processor in one
> > > socket
> > > > > has CPUs reporting MCG_CMCI_P, while other processor
does not.
> Your SMP
> > > > > is not quite symmetric, perhaps processors were from
different
> bins?
> > >
> >
> > I found 2 other servers that exhibit the same problem: the first 16
cores
> > have bit 10 set and the second 16 don't. All 3 have dual Xeon
Gold 6142
> > CPUs and SuperMicro X11DPU motherboards with BIOS revision 5.12. I
have
> > other examples of X11DPU motherboards that don't exhibit the
problem, but
> > they all have both different CPUs and different BIOS revisions. So I
> can't
> > be sure whether the bug follows the CPU model or the BIOS version.
> I looked at the full spec update errata list for the first gen Skylake
> Xeons, but did not noticed anything relevant. EDS doc does not provide
> much useful info on the MSR 0x179 bit 10 either, except rewording SDM
> definition.
>
> In fact I am not sure but this bit might be writeable by software. Try
> to flip the bit with cpucontrol(8). Might be it is a BIOS bug after all.
>
> If you have Intel representative contact, or Supermicro contact, try to
> engage them. I do not have any further ideas, since spec update does not
> mention the problem.
>
> >
> >
> > > > >
> > > >
> > > > Could be. Is there some MSR that reports a more specific
version
> number?
> > > There are CPUID %eax=1 values returned in %eax, but then it
requires
> > > some interpretation.
> > > # cpucontrol -i 1 /dev/cpuctl$x
> > > for $x iterating over the cpus.
> > >
> >
> > Apart from the Local APIC ID field, that returns the same value for
all
> > processors.
> >
> > Your second patch doesn't cause any obvious problems on my dev
system.
> I hope that you would confirm that the issue is solved by it, after some
> time.
>
Upgrading the BIOS fixed the problem, by clearing the MCG_CMCI_P bit on all
processors. I don't have strong opinions about whether we should commit
kib's patch too. Kib, what do you think?
-Alan