On Sun, Nov 25, 2018 at 11:35:30PM -0500, Garrett Wollman
wrote:> <<On Mon, 19 Nov 2018 07:09:44 +0200, Konstantin Belousov
<kostikbel at gmail.com> said:
>
> > On Sun, Nov 18, 2018 at 08:24:38PM -0500, Garrett Wollman wrote:
> >> Has anyone seen this before? It's on a busy NFS server, but
hasn't
> >> been observed on any of our other NFS servers.
> >>
> >>
------------------------------------------------------------------------
> >> Fatal trap 12: page fault while in kernel mode
>
> >> --- trap 0xc, rip = 0xffffffff809a903d, rsp = 0xfffffe17eb8d0710,
rbp = 0xfffffe17eb8d0750 ---
> >> vm_page_alloc_after() at vm_page_alloc_after+0x15d/frame
0xfffffe17eb8d0750
>
> > What is the line number for vm_page_alloc_after+0x15d ?
> > Do you have NUMA enabled on 11 ?
>
> If gdb is to be believed, the trap is at line 1687:
>
> /*
> * At this point we had better have found a good page.
> */
> KASSERT(m != NULL, ("missing page"));
> free_count = vm_phys_freecnt_adj(m, -1);
> >>>>>> if ((m->flags & PG_ZERO) != 0)
> vm_page_zero_count--;
> mtx_unlock(&vm_page_queue_free_mtx);
> vm_page_alloc_check(m);
>
> The faulting instruction is:
>
> 0xffffffff809a903d <vm_page_alloc_after+349>: testb
$0x8,0x5a(%r14)
>
> There are no options matching /numa/i in the configuration. (This is
> a non-debugging configuration so the KASSERT is inoperative, I
> assume.) I have about a dozen other servers with the same kernel and
> they're not crashing, but obviously they all have different loads and
> sets of active clients.
If you're using a Skylake, I suspect that you can set the
hw.skz63_enable tunable to 0 as a workaround, assuming you're not using
any code that relies on Intel TSX. (I don't think there's anything in
the base system that does.) There are some details in
https://reviews.freebsd.org/D18374