On Tue, Sep 20, 2016 at 11:38:54PM +0300, Slawa Olhovchenkov wrote:> On Tue, Sep 20, 2016 at 11:19:25PM +0300, Konstantin Belousov wrote: > > > On Tue, Sep 20, 2016 at 10:20:53PM +0300, Slawa Olhovchenkov wrote: > > > On Tue, Sep 20, 2016 at 09:52:44AM +0300, Slawa Olhovchenkov wrote: > > > > > > > On Mon, Sep 19, 2016 at 06:05:46PM -0700, John Baldwin wrote: > > > > > > > > > > > If this panics, then vmspace_switch_aio() is not working for > > > > > > > some reason. > > > > > > > > > > > > I am try using next DTrace script: > > > > > > ===> > > > > > #pragma D option dynvarsize=64m > > > > > > > > > > > > int req[struct vmspace *, void *]; > > > > > > self int trace; > > > > > > > > > > > > syscall:freebsd:aio_read:entry > > > > > > { > > > > > > this->aio = *(struct aiocb *)copyin(arg0, sizeof(struct aiocb)); > > > > > > req[curthread->td_proc->p_vmspace, this->aio.aio_buf] = curthread->td_proc->p_pid; > > > > > > } > > > > > > > > > > > > fbt:kernel:aio_process_rw:entry > > > > > > { > > > > > > self->job = args[0]; > > > > > > self->trace = 1; > > > > > > } > > > > > > > > > > > > fbt:kernel:aio_process_rw:return > > > > > > /self->trace/ > > > > > > { > > > > > > req[self->job->userproc->p_vmspace, self->job->uaiocb.aio_buf] = 0; > > > > > > self->job = 0; > > > > > > self->trace = 0; > > > > > > } > > > > > > > > > > > > fbt:kernel:vn_io_fault:entry > > > > > > /self->trace && !req[curthread->td_proc->p_vmspace, args[1]->uio_iov[0].iov_base]/ > > > > > > { > > > > > > this->buf = args[1]->uio_iov[0].iov_base; > > > > > > printf("%Y vn_io_fault %p:%p pid %d\n", walltimestamp, curthread->td_proc->p_vmspace, this->buf, req[curthread->td_proc->p_vmspace, this->buf]); > > > > > > } > > > > > > ==> > > > > > > > > > > > And don't got any messages near nginx core dump. > > > > > > What I can check next? > > > > > > May be check context/address space switch for kernel process? > > > > > > > > > > Which CPU are you using? > > > > > > > > CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (2000.04-MHz K8-class CPU) > > Is this sandy bridge ? > > Sandy Bridge EP > > > Show me first 100 lines of the verbose dmesg, > > After day or two, after end of this test run -- I am need to enable verbose. > > > I want to see cpu features lines. In particular, does you CPU support > > the INVPCID feature. > > CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (2000.05-MHz K8-class CPU) > Origin="GenuineIntel" Id=0x206d7 Family=0x6 Model=0x2d Stepping=7 > Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> > Features2=0x1fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX> > AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> > AMD Features2=0x1<LAHF> > XSAVE Features=0x1<XSAVEOPT> > VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID > TSC: P-state invariant, performance statistics > > I am don't see this feature before E5v3: > > CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz (2600.06-MHz K8-class CPU) > Origin="GenuineIntel" Id=0x306e4 Family=0x6 Model=0x3e Stepping=4 > Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> > Features2=0x7fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND> > AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> > AMD Features2=0x1<LAHF> > Structured Extended Features=0x281<FSGSBASE,SMEP,ERMS> > XSAVE Features=0x1<XSAVEOPT> > VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr > TSC: P-state invariant, performance statistics > > (don't run 11.0 on this CPU)Ok.> > CPU: Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz (2600.05-MHz K8-class CPU) > Origin="GenuineIntel" Id=0x306f2 Family=0x6 Model=0x3f Stepping=2 > Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> > Features2=0x7ffefbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND> > AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> > AMD Features2=0x21<LAHF,ABM> > Structured Extended Features=0x37ab<FSGSBASE,TSCADJ,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,PQM,NFPUSG> > XSAVE Features=0x1<XSAVEOPT> > VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr > TSC: P-state invariant, performance statistics > > (11.0 run w/o this issuse)Do you mean that similarly configured nginx+aio do not demonstrate the corruption on this machine ?> > > Also you may show me the 'sysctl vm.pmap' output. > > # sysctl vm.pmap > vm.pmap.pdpe.demotions: 3 > vm.pmap.pde.promotions: 172495 > vm.pmap.pde.p_failures: 2119294 > vm.pmap.pde.mappings: 1927 > vm.pmap.pde.demotions: 126192 > vm.pmap.pcid_save_cnt: 0 > vm.pmap.invpcid_works: 0 > vm.pmap.pcid_enabled: 0 > vm.pmap.pg_ps_enabled: 1 > vm.pmap.pat_works: 1 > > This is after vm.pmap.pcid_enabled=0 in loader.conf > > > > > > > > > > Perhaps try disabling PCID support (I think vm.pmap.pcid_enabled=0 from > > > > > loader prompt or loader.conf)? (Wondering if pmap_activate() is somehow not switching) > > > > > > I am need some more time to test (day or two), but now this is like > > > workaround/solution: 12h runtime and peak hour w/o nginx crash. > > > (vm.pmap.pcid_enabled=0 in loader.conf). > > > > Please try this variation of the previous patch. > > and remove vm.pmap.pcid_enabled=0?Definitely.> > > diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c > > index a23468e..f754652 100644 > > --- a/sys/vm/vm_map.c > > +++ b/sys/vm/vm_map.c > > @@ -481,6 +481,7 @@ vmspace_switch_aio(struct vmspace *newvm) > > if (oldvm == newvm) > > return; > > > > + spinlock_enter(); > > /* > > * Point to the new address space and refer to it. > > */ > > @@ -489,6 +490,7 @@ vmspace_switch_aio(struct vmspace *newvm) > > > > /* Activate the new mapping. */ > > pmap_activate(curthread); > > + spinlock_exit(); > > > > /* Remove the daemon's reference to the old address space. */ > > KASSERT(oldvm->vm_refcnt > 1,
On Wed, Sep 21, 2016 at 12:15:17AM +0300, Konstantin Belousov wrote:> On Tue, Sep 20, 2016 at 11:38:54PM +0300, Slawa Olhovchenkov wrote: > > On Tue, Sep 20, 2016 at 11:19:25PM +0300, Konstantin Belousov wrote: > > > > > On Tue, Sep 20, 2016 at 10:20:53PM +0300, Slawa Olhovchenkov wrote: > > > > On Tue, Sep 20, 2016 at 09:52:44AM +0300, Slawa Olhovchenkov wrote: > > > > > > > > > On Mon, Sep 19, 2016 at 06:05:46PM -0700, John Baldwin wrote: > > > > > > > > > > > > > If this panics, then vmspace_switch_aio() is not working for > > > > > > > > some reason. > > > > > > > > > > > > > > I am try using next DTrace script: > > > > > > > ===> > > > > > > #pragma D option dynvarsize=64m > > > > > > > > > > > > > > int req[struct vmspace *, void *]; > > > > > > > self int trace; > > > > > > > > > > > > > > syscall:freebsd:aio_read:entry > > > > > > > { > > > > > > > this->aio = *(struct aiocb *)copyin(arg0, sizeof(struct aiocb)); > > > > > > > req[curthread->td_proc->p_vmspace, this->aio.aio_buf] = curthread->td_proc->p_pid; > > > > > > > } > > > > > > > > > > > > > > fbt:kernel:aio_process_rw:entry > > > > > > > { > > > > > > > self->job = args[0]; > > > > > > > self->trace = 1; > > > > > > > } > > > > > > > > > > > > > > fbt:kernel:aio_process_rw:return > > > > > > > /self->trace/ > > > > > > > { > > > > > > > req[self->job->userproc->p_vmspace, self->job->uaiocb.aio_buf] = 0; > > > > > > > self->job = 0; > > > > > > > self->trace = 0; > > > > > > > } > > > > > > > > > > > > > > fbt:kernel:vn_io_fault:entry > > > > > > > /self->trace && !req[curthread->td_proc->p_vmspace, args[1]->uio_iov[0].iov_base]/ > > > > > > > { > > > > > > > this->buf = args[1]->uio_iov[0].iov_base; > > > > > > > printf("%Y vn_io_fault %p:%p pid %d\n", walltimestamp, curthread->td_proc->p_vmspace, this->buf, req[curthread->td_proc->p_vmspace, this->buf]); > > > > > > > } > > > > > > > ==> > > > > > > > > > > > > > And don't got any messages near nginx core dump. > > > > > > > What I can check next? > > > > > > > May be check context/address space switch for kernel process? > > > > > > > > > > > > Which CPU are you using? > > > > > > > > > > CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (2000.04-MHz K8-class CPU) > > > Is this sandy bridge ? > > > > Sandy Bridge EP > > > > > Show me first 100 lines of the verbose dmesg, > > > > After day or two, after end of this test run -- I am need to enable verbose. > > > > > I want to see cpu features lines. In particular, does you CPU support > > > the INVPCID feature. > > > > CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (2000.05-MHz K8-class CPU) > > Origin="GenuineIntel" Id=0x206d7 Family=0x6 Model=0x2d Stepping=7 > > Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> > > Features2=0x1fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX> > > AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> > > AMD Features2=0x1<LAHF> > > XSAVE Features=0x1<XSAVEOPT> > > VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID > > TSC: P-state invariant, performance statistics > > > > I am don't see this feature before E5v3: > > > > CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz (2600.06-MHz K8-class CPU) > > Origin="GenuineIntel" Id=0x306e4 Family=0x6 Model=0x3e Stepping=4 > > Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> > > Features2=0x7fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND> > > AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> > > AMD Features2=0x1<LAHF> > > Structured Extended Features=0x281<FSGSBASE,SMEP,ERMS> > > XSAVE Features=0x1<XSAVEOPT> > > VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr > > TSC: P-state invariant, performance statistics > > > > (don't run 11.0 on this CPU) > Ok. > > > > > CPU: Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz (2600.05-MHz K8-class CPU) > > Origin="GenuineIntel" Id=0x306f2 Family=0x6 Model=0x3f Stepping=2 > > Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> > > Features2=0x7ffefbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND> > > AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> > > AMD Features2=0x21<LAHF,ABM> > > Structured Extended Features=0x37ab<FSGSBASE,TSCADJ,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,PQM,NFPUSG> > > XSAVE Features=0x1<XSAVEOPT> > > VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr > > TSC: P-state invariant, performance statistics > > > > (11.0 run w/o this issuse) > Do you mean that similarly configured nginx+aio do not demonstrate the corruption on this machine ?Yes. But different storage configuration and different pattern load. Also 11.0 run w/o this issuse on CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz (2200.04-MHz K8-class CPU) Origin="GenuineIntel" Id=0x406f1 Family=0x6 Model=0x4f Stepping=1 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x7ffefbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> AMD Features2=0x121<LAHF,ABM,Prefetch> Structured Extended Features=0x21cbfbb<FSGSBASE,TSCADJ,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,PQM,NFPUSG,PQE,RDSEED,ADX,SMAP,PROCTRACE> XSAVE Features=0x1<XSAVEOPT> VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr TSC: P-state invariant, performance statistics PS: all systems is dual-cpu.
On Wed, Sep 21, 2016 at 12:15:17AM +0300, Konstantin Belousov wrote:> > > diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c > > > index a23468e..f754652 100644 > > > --- a/sys/vm/vm_map.c > > > +++ b/sys/vm/vm_map.c > > > @@ -481,6 +481,7 @@ vmspace_switch_aio(struct vmspace *newvm) > > > if (oldvm == newvm) > > > return; > > > > > > + spinlock_enter(); > > > /* > > > * Point to the new address space and refer to it. > > > */ > > > @@ -489,6 +490,7 @@ vmspace_switch_aio(struct vmspace *newvm) > > > > > > /* Activate the new mapping. */ > > > pmap_activate(curthread); > > > + spinlock_exit(); > > > > > > /* Remove the daemon's reference to the old address space. */ > > > KASSERT(oldvm->vm_refcnt > 1,Did you tested the patch ? Below is, I believe, the committable fix, of course supposing that the patch above worked. If you want to retest it on stable/11, ignore efirt.c chunks. diff --git a/sys/amd64/amd64/efirt.c b/sys/amd64/amd64/efirt.c index f1d67f7..c883af8 100644 --- a/sys/amd64/amd64/efirt.c +++ b/sys/amd64/amd64/efirt.c @@ -53,6 +53,7 @@ __FBSDID("$FreeBSD$"); #include <machine/vmparam.h> #include <vm/vm.h> #include <vm/pmap.h> +#include <vm/vm_map.h> #include <vm/vm_object.h> #include <vm/vm_page.h> #include <vm/vm_pager.h> @@ -301,6 +302,17 @@ efi_enter(void) PMAP_UNLOCK(curpmap); return (error); } + + /* + * IPI TLB shootdown handler invltlb_pcid_handler() reloads + * %cr3 from the curpmap->pm_cr3, which would disable runtime + * segments mappings. Block the handler's action by setting + * curpmap to impossible value. See also comment in + * pmap.c:pmap_activate_sw(). + */ + if (pmap_pcid_enabled && !invpcid_works) + PCPU_SET(curpmap, NULL); + load_cr3(VM_PAGE_TO_PHYS(efi_pml4_page) | (pmap_pcid_enabled ? curpmap->pm_pcids[PCPU_GET(cpuid)].pm_pcid : 0)); /* @@ -317,7 +329,9 @@ efi_leave(void) { pmap_t curpmap; - curpmap = PCPU_GET(curpmap); + curpmap = &curproc->p_vmspace->vm_pmap; + if (pmap_pcid_enabled && !invpcid_works) + PCPU_SET(curpmap, curpmap); load_cr3(curpmap->pm_cr3 | (pmap_pcid_enabled ? curpmap->pm_pcids[PCPU_GET(cpuid)].pm_pcid : 0)); if (!pmap_pcid_enabled) diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c index 63042e4..59e1b67 100644 --- a/sys/amd64/amd64/pmap.c +++ b/sys/amd64/amd64/pmap.c @@ -6842,6 +6842,7 @@ pmap_activate_sw(struct thread *td) { pmap_t oldpmap, pmap; uint64_t cached, cr3; + register_t rflags; u_int cpuid; oldpmap = PCPU_GET(curpmap); @@ -6865,16 +6866,43 @@ pmap_activate_sw(struct thread *td) pmap == kernel_pmap, ("non-kernel pmap thread %p pmap %p cpu %d pcid %#x", td, pmap, cpuid, pmap->pm_pcids[cpuid].pm_pcid)); + + /* + * If the INVPCID instruction is not available, + * invltlb_pcid_handler() is used for handle + * invalidate_all IPI, which checks for curpmap =+ * smp_tlb_pmap. Below operations sequence has a + * window where %CR3 is loaded with the new pmap's + * PML4 address, but curpmap value is not yet updated. + * This causes invltlb IPI handler, called between the + * updates, to execute as NOP, which leaves stale TLB + * entries. + * + * Note that the most typical use of + * pmap_activate_sw(), from the context switch, is + * immune to this race, because interrupts are + * disabled (while the thread lock is owned), and IPI + * happends after curpmap is updated. Protect other + * callers in a similar way, by disabling interrupts + * around the %cr3 register reload and curpmap + * assignment. + */ + if (!invpcid_works) + rflags = intr_disable(); + if (!cached || (cr3 & ~CR3_PCID_MASK) != pmap->pm_cr3) { load_cr3(pmap->pm_cr3 | pmap->pm_pcids[cpuid].pm_pcid | cached); if (cached) PCPU_INC(pm_save_cnt); } + PCPU_SET(curpmap, pmap); + if (!invpcid_works) + intr_restore(rflags); } else if (cr3 != pmap->pm_cr3) { load_cr3(pmap->pm_cr3); + PCPU_SET(curpmap, pmap); } - PCPU_SET(curpmap, pmap); #ifdef SMP CPU_CLR_ATOMIC(cpuid, &oldpmap->pm_active); #else