On Sun, Feb 07, 2021 at 02:33:11PM -0700, Alan Somers
wrote:> On Fri, Feb 5, 2021 at 10:21 AM Konstantin Belousov <kostikbel at
gmail.com>
> wrote:
>
> > On Fri, Feb 05, 2021 at 09:01:26AM -0700, Alan Somers wrote:
> > > On Fri, Feb 5, 2021 at 7:41 AM Konstantin Belousov <kostikbel
at gmail.com>
> > > wrote:
> > >
> > > > On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote:
> > > > > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov <
> > kostikbel at gmail.com>
> > > > > wrote:
> > > > >
> > > > > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan
Somers wrote:
> > > > > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin
Belousov <
> > > > kostikbel at gmail.com>
> > > > > > > wrote:
> > > > > > > > Do you have INVARIANTS enabled? If not,
I am curious if
> > enabling
> > > > them
> > > > > > > > would convert that rare page fault into
rare "CPU %d has more
> > MC
> > > > banks"
> > > > > > > > assert.
> > > > > > > >
> > > > > > > > Also might be the output of the
> > > > > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0)
; do cpucontrol -m
> > 0x179
> > > > > > > > /dev/cpuctl$x; done
> > > > > > > > command will show the issue (0x179 is
the MCG_CAP MSR).
> > > > > > > > You need to load cpuctl(4) if it is not
loaded yet.
> > > > > > > >
> > > > > > >
> > > > > > > I don't have INVARIANTS enabled, and I
can't enable it on the
> > > > production
> > > > > > > servers. However, I can turn those three
KASSERTs into VERIFYs
> > and
> > > > see
> > > > > > > what happens. Here is what your command
shows on the server that
> > > > > > panicked:
> > > > > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do
sudo cpucontrol -m
> > > > 0x179
> > > > > > > /dev/cpuctl$x; done | uniq -c
> > > > > > > 16 MSR 0x179: 0x00000000 0x0f000c14
> > > > > > > 16 MSR 0x179: 0x00000000 0x0f000814
> > > > > >
> > > > > > It probably explains it, but it would be more
telling if you left
> > the
> > > > > > output as is, so that we can see which CPUs have
MCG_CMCI_P (10)
> > bit
> > > > set.
> > > > > >
> > > > >
> > > > > I didn't sort them, so the first 16 have bit 10 set
and the second 16
> > > > > don't.
> > > > >
> > > > >
> > > > > >
> > > > > > I suspect that your machine has two sockets, and
processor in one
> > > > socket
> > > > > > has CPUs reporting MCG_CMCI_P, while other
processor does not.
> > Your SMP
> > > > > > is not quite symmetric, perhaps processors were
from different
> > bins?
> > > >
> > >
> > > I found 2 other servers that exhibit the same problem: the first
16 cores
> > > have bit 10 set and the second 16 don't. All 3 have dual
Xeon Gold 6142
> > > CPUs and SuperMicro X11DPU motherboards with BIOS revision 5.12.
I have
> > > other examples of X11DPU motherboards that don't exhibit the
problem, but
> > > they all have both different CPUs and different BIOS revisions.
So I
> > can't
> > > be sure whether the bug follows the CPU model or the BIOS
version.
> > I looked at the full spec update errata list for the first gen Skylake
> > Xeons, but did not noticed anything relevant. EDS doc does not provide
> > much useful info on the MSR 0x179 bit 10 either, except rewording SDM
> > definition.
> >
> > In fact I am not sure but this bit might be writeable by software. Try
> > to flip the bit with cpucontrol(8). Might be it is a BIOS bug after
all.
> >
> > If you have Intel representative contact, or Supermicro contact, try
to
> > engage them. I do not have any further ideas, since spec update does
not
> > mention the problem.
> >
> > >
> > >
> > > > > >
> > > > >
> > > > > Could be. Is there some MSR that reports a more
specific version
> > number?
> > > > There are CPUID %eax=1 values returned in %eax, but then it
requires
> > > > some interpretation.
> > > > # cpucontrol -i 1 /dev/cpuctl$x
> > > > for $x iterating over the cpus.
> > > >
> > >
> > > Apart from the Local APIC ID field, that returns the same value
for all
> > > processors.
> > >
> > > Your second patch doesn't cause any obvious problems on my
dev system.
> > I hope that you would confirm that the issue is solved by it, after
some
> > time.
> >
>
> Upgrading the BIOS fixed the problem, by clearing the MCG_CMCI_P bit on all
> processors. I don't have strong opinions about whether we should
commit
> kib's patch too. Kib, what do you think?
The patch causes some memory over-use.
If this issue is not too widely experienced, I prefer to not commit the patch.