On Fri, Feb 05, 2021 at 09:01:26AM -0700, Alan Somers
wrote:> On Fri, Feb 5, 2021 at 7:41 AM Konstantin Belousov <kostikbel at
gmail.com>
> wrote:
>
> > On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote:
> > > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov <kostikbel
at gmail.com>
> > > wrote:
> > >
> > > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote:
> > > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov <
> > kostikbel at gmail.com>
> > > > > wrote:
> > > > > > Do you have INVARIANTS enabled? If not, I am
curious if enabling
> > them
> > > > > > would convert that rare page fault into rare
"CPU %d has more MC
> > banks"
> > > > > > assert.
> > > > > >
> > > > > > Also might be the output of the
> > > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do
cpucontrol -m 0x179
> > > > > > /dev/cpuctl$x; done
> > > > > > command will show the issue (0x179 is the MCG_CAP
MSR).
> > > > > > You need to load cpuctl(4) if it is not loaded
yet.
> > > > > >
> > > > >
> > > > > I don't have INVARIANTS enabled, and I can't
enable it on the
> > production
> > > > > servers. However, I can turn those three KASSERTs into
VERIFYs and
> > see
> > > > > what happens. Here is what your command shows on the
server that
> > > > panicked:
> > > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo
cpucontrol -m
> > 0x179
> > > > > /dev/cpuctl$x; done | uniq -c
> > > > > 16 MSR 0x179: 0x00000000 0x0f000c14
> > > > > 16 MSR 0x179: 0x00000000 0x0f000814
> > > >
> > > > It probably explains it, but it would be more telling if you
left the
> > > > output as is, so that we can see which CPUs have MCG_CMCI_P
(10) bit
> > set.
> > > >
> > >
> > > I didn't sort them, so the first 16 have bit 10 set and the
second 16
> > > don't.
> > >
> > >
> > > >
> > > > I suspect that your machine has two sockets, and processor
in one
> > socket
> > > > has CPUs reporting MCG_CMCI_P, while other processor does
not. Your SMP
> > > > is not quite symmetric, perhaps processors were from
different bins?
> >
>
> I found 2 other servers that exhibit the same problem: the first 16 cores
> have bit 10 set and the second 16 don't. All 3 have dual Xeon Gold
6142
> CPUs and SuperMicro X11DPU motherboards with BIOS revision 5.12. I have
> other examples of X11DPU motherboards that don't exhibit the problem,
but
> they all have both different CPUs and different BIOS revisions. So I
can't
> be sure whether the bug follows the CPU model or the BIOS version.
I looked at the full spec update errata list for the first gen Skylake
Xeons, but did not noticed anything relevant. EDS doc does not provide
much useful info on the MSR 0x179 bit 10 either, except rewording SDM
definition.
In fact I am not sure but this bit might be writeable by software. Try
to flip the bit with cpucontrol(8). Might be it is a BIOS bug after all.
If you have Intel representative contact, or Supermicro contact, try to
engage them. I do not have any further ideas, since spec update does not
mention the problem.
>
>
> > > >
> > >
> > > Could be. Is there some MSR that reports a more specific version
number?
> > There are CPUID %eax=1 values returned in %eax, but then it requires
> > some interpretation.
> > # cpucontrol -i 1 /dev/cpuctl$x
> > for $x iterating over the cpus.
> >
>
> Apart from the Local APIC ID field, that returns the same value for all
> processors.
>
> Your second patch doesn't cause any obvious problems on my dev system.
I hope that you would confirm that the issue is solved by it, after some
time.