On Fri, Feb 5, 2021 at 7:41 AM Konstantin Belousov <kostikbel at
gmail.com>
wrote:
> On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote:
> > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov <kostikbel at
gmail.com>
> > wrote:
> >
> > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote:
> > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov <
> kostikbel at gmail.com>
> > > > wrote:
> > > > > Do you have INVARIANTS enabled? If not, I am curious
if enabling
> them
> > > > > would convert that rare page fault into rare "CPU
%d has more MC
> banks"
> > > > > assert.
> > > > >
> > > > > Also might be the output of the
> > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do
cpucontrol -m 0x179
> > > > > /dev/cpuctl$x; done
> > > > > command will show the issue (0x179 is the MCG_CAP MSR).
> > > > > You need to load cpuctl(4) if it is not loaded yet.
> > > > >
> > > >
> > > > I don't have INVARIANTS enabled, and I can't enable
it on the
> production
> > > > servers. However, I can turn those three KASSERTs into
VERIFYs and
> see
> > > > what happens. Here is what your command shows on the server
that
> > > panicked:
> > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo
cpucontrol -m
> 0x179
> > > > /dev/cpuctl$x; done | uniq -c
> > > > 16 MSR 0x179: 0x00000000 0x0f000c14
> > > > 16 MSR 0x179: 0x00000000 0x0f000814
> > >
> > > It probably explains it, but it would be more telling if you left
the
> > > output as is, so that we can see which CPUs have MCG_CMCI_P (10)
bit
> set.
> > >
> >
> > I didn't sort them, so the first 16 have bit 10 set and the second
16
> > don't.
> >
> >
> > >
> > > I suspect that your machine has two sockets, and processor in one
> socket
> > > has CPUs reporting MCG_CMCI_P, while other processor does not.
Your SMP
> > > is not quite symmetric, perhaps processors were from different
bins?
>
I found 2 other servers that exhibit the same problem: the first 16 cores
have bit 10 set and the second 16 don't. All 3 have dual Xeon Gold 6142
CPUs and SuperMicro X11DPU motherboards with BIOS revision 5.12. I have
other examples of X11DPU motherboards that don't exhibit the problem, but
they all have both different CPUs and different BIOS revisions. So I can't
be sure whether the bug follows the CPU model or the BIOS version.
> > >
> >
> > Could be. Is there some MSR that reports a more specific version
number?
> There are CPUID %eax=1 values returned in %eax, but then it requires
> some interpretation.
> # cpucontrol -i 1 /dev/cpuctl$x
> for $x iterating over the cpus.
>
Apart from the Local APIC ID field, that returns the same value for all
processors.
Your second patch doesn't cause any obvious problems on my dev system.