[ Cover-letter is identical to v2, including benchmark results, excluding the change log. ] Currently, local and remote TLB flushes are not performed concurrently, which introduces unnecessary overhead - each INVLPG can take 100s of cycles. This patch-set allows TLB flushes to be run concurrently: first request the remote CPUs to initiate the flush, then run it locally, and finally wait for the remote CPUs to finish their work. In addition, there are various small optimizations to avoid unwarranted false-sharing and atomic operations. The proposed changes should also improve the performance of other invocations of on_each_cpu(). Hopefully, no one has relied on this behavior of on_each_cpu() that invoked functions first remotely and only then locally [Peter says he remembers someone might do so, but without further information it is hard to know how to address it]. Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on 2-socket, 48-logical-cores (24+SMT) Haswell-X, 5 repetitions: sysbench fileio --file-total-size=3G --file-test-mode=rndwr \ --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run Th. tip-jun28 avg (stdev) +patch-set avg (stdev) change --- --------------------- ---------------------- ------ 1 1267765 (14146) 1299253 (5715) +2.4% 2 1734644 (11936) 1799225 (19577) +3.7% 4 2821268 (41184) 2919132 (40149) +3.4% 8 4171652 (31243) 4376925 (65416) +4.9% 16 5590729 (24160) 5829866 (8127) +4.2% 24 6250212 (24481) 6522303 (28044) +4.3% 32 3994314 (26606) 4077543 (10685) +2.0% 48 4345177 (28091) 4417821 (41337) +1.6% (Note that on configurations with up to 24 threads numactl was used to set all threads on socket 1, which explains the drop in performance when going to 32 threads). Running the same benchmark with security mitigations disabled (PTI, Spectre, MDS): Th. tip-jun28 avg (stdev) +patch-set avg (stdev) change --- --------------------- ---------------------- ------ 1 1598896 (5174) 1607903 (4091) +0.5% 2 2109472 (17827) 2224726 (4372) +5.4% 4 3448587 (11952) 3668551 (30219) +6.3% 8 5425778 (29641) 5606266 (33519) +3.3% 16 6931232 (34677) 7054052 (27873) +1.7% 24 7612473 (23482) 7783138 (13871) +2.2% 32 4296274 (18029) 4283279 (32323) -0.3% 48 4770029 (35541) 4764760 (13575) -0.1% Presumably, PTI requires two invalidations of each mapping, which allows to get higher benefits from concurrency when PTI is on. At the same time, when mitigations are on, other overheads reduce the potential speedup. I tried to reduce the size of the code of the main patch, which required restructuring of the series. v2 -> v3: * Open-code the remote/local-flush decision code [Andy] * Fix hyper-v, Xen implementations [Andrew] * Fix redundant TLB flushes. v1 -> v2: * Removing the patches that Thomas took [tglx] * Adding hyper-v, Xen compile-tested implementations [Dave] * Removing UV [Andy] * Adding lazy optimization, removing inline keyword [Dave] * Restructuring patch-set RFCv2 -> v1: * Fix comment on flush_tlb_multi [Juergen] * Removing async invalidation optimizations [Andy] * Adding KVM support [Paolo] Cc: Andy Lutomirski <luto at kernel.org> Cc: Borislav Petkov <bp at alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky at oracle.com> Cc: Dave Hansen <dave.hansen at linux.intel.com> Cc: Haiyang Zhang <haiyangz at microsoft.com> Cc: Ingo Molnar <mingo at redhat.com> Cc: Josh Poimboeuf <jpoimboe at redhat.com> Cc: Juergen Gross <jgross at suse.com> Cc: "K. Y. Srinivasan" <kys at microsoft.com> Cc: Paolo Bonzini <pbonzini at redhat.com> Cc: Peter Zijlstra <peterz at infradead.org> Cc: Rik van Riel <riel at surriel.com> Cc: Sasha Levin <sashal at kernel.org> Cc: Stephen Hemminger <sthemmin at microsoft.com> Cc: Thomas Gleixner <tglx at linutronix.de> Cc: kvm at vger.kernel.org Cc: linux-hyperv at vger.kernel.org Cc: linux-kernel at vger.kernel.org Cc: virtualization at lists.linux-foundation.org Cc: x86 at kernel.org Cc: xen-devel at lists.xenproject.org Nadav Amit (9): smp: Run functions concurrently in smp_call_function_many() x86/mm/tlb: Remove reason as argument for flush_tlb_func_local() x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() x86/mm/tlb: Flush remote and local TLBs concurrently x86/mm/tlb: Privatize cpu_tlbstate x86/mm/tlb: Do not make is_lazy dirty for no reason cpumask: Mark functions as pure x86/mm/tlb: Remove UV special case x86/mm/tlb: Remove unnecessary uses of the inline keyword arch/x86/hyperv/mmu.c | 10 +- arch/x86/include/asm/paravirt.h | 6 +- arch/x86/include/asm/paravirt_types.h | 4 +- arch/x86/include/asm/tlbflush.h | 47 ++++----- arch/x86/include/asm/trace/hyperv.h | 2 +- arch/x86/kernel/kvm.c | 11 ++- arch/x86/kernel/paravirt.c | 2 +- arch/x86/mm/init.c | 2 +- arch/x86/mm/tlb.c | 133 ++++++++++++++++---------- arch/x86/xen/mmu_pv.c | 11 +-- include/linux/cpumask.h | 6 +- include/linux/smp.h | 27 ++++-- include/trace/events/xen.h | 2 +- kernel/smp.c | 133 ++++++++++++-------------- 14 files changed, 218 insertions(+), 178 deletions(-) -- 2.20.1
Nadav Amit
2019-Jul-19 00:58 UTC
[PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
To improve TLB shootdown performance, flush the remote and local TLBs concurrently. Introduce flush_tlb_multi() that does so. Introduce paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen and hyper-v are only compile-tested). While the updated smp infrastructure is capable of running a function on a single local core, it is not optimized for this case. The multiple function calls and the indirect branch introduce some overhead, and might make local TLB flushes slower than they were before the recent changes. Before calling the SMP infrastructure, check if only a local TLB flush is needed to restore the lost performance in this common case. This requires to check mm_cpumask() one more time, but unless this mask is updated very frequently, this should impact performance negatively. Cc: "K. Y. Srinivasan" <kys at microsoft.com> Cc: Haiyang Zhang <haiyangz at microsoft.com> Cc: Stephen Hemminger <sthemmin at microsoft.com> Cc: Sasha Levin <sashal at kernel.org> Cc: Thomas Gleixner <tglx at linutronix.de> Cc: Ingo Molnar <mingo at redhat.com> Cc: Borislav Petkov <bp at alien8.de> Cc: x86 at kernel.org Cc: Juergen Gross <jgross at suse.com> Cc: Paolo Bonzini <pbonzini at redhat.com> Cc: Dave Hansen <dave.hansen at linux.intel.com> Cc: Andy Lutomirski <luto at kernel.org> Cc: Peter Zijlstra <peterz at infradead.org> Cc: Boris Ostrovsky <boris.ostrovsky at oracle.com> Cc: linux-hyperv at vger.kernel.org Cc: linux-kernel at vger.kernel.org Cc: virtualization at lists.linux-foundation.org Cc: kvm at vger.kernel.org Cc: xen-devel at lists.xenproject.org Signed-off-by: Nadav Amit <namit at vmware.com> --- arch/x86/hyperv/mmu.c | 10 +++--- arch/x86/include/asm/paravirt.h | 6 ++-- arch/x86/include/asm/paravirt_types.h | 4 +-- arch/x86/include/asm/tlbflush.h | 8 ++--- arch/x86/include/asm/trace/hyperv.h | 2 +- arch/x86/kernel/kvm.c | 11 +++++-- arch/x86/kernel/paravirt.c | 2 +- arch/x86/mm/tlb.c | 47 ++++++++++++++++++--------- arch/x86/xen/mmu_pv.c | 11 +++---- include/trace/events/xen.h | 2 +- 10 files changed, 62 insertions(+), 41 deletions(-) diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c index e65d7fe6489f..8740d8b21db3 100644 --- a/arch/x86/hyperv/mmu.c +++ b/arch/x86/hyperv/mmu.c @@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset, return gva_n - offset; } -static void hyperv_flush_tlb_others(const struct cpumask *cpus, - const struct flush_tlb_info *info) +static void hyperv_flush_tlb_multi(const struct cpumask *cpus, + const struct flush_tlb_info *info) { int cpu, vcpu, gva_n, max_gvas; struct hv_tlb_flush **flush_pcpu; @@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus, u64 status = U64_MAX; unsigned long flags; - trace_hyperv_mmu_flush_tlb_others(cpus, info); + trace_hyperv_mmu_flush_tlb_multi(cpus, info); if (!hv_hypercall_pg) goto do_native; @@ -156,7 +156,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus, if (!(status & HV_HYPERCALL_RESULT_MASK)) return; do_native: - native_flush_tlb_others(cpus, info); + native_flush_tlb_multi(cpus, info); } static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus, @@ -231,6 +231,6 @@ void hyperv_setup_mmu_ops(void) return; pr_info("Using hypercall for remote TLB flush\n"); - pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others; + pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi; pv_ops.mmu.tlb_remove_table = tlb_remove_table; } diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index dce26f1d13e1..8c6c2394393b 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr) PVOP_VCALL1(mmu.flush_tlb_one_user, addr); } -static inline void flush_tlb_others(const struct cpumask *cpumask, - const struct flush_tlb_info *info) +static inline void flush_tlb_multi(const struct cpumask *cpumask, + const struct flush_tlb_info *info) { - PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info); + PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info); } static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 639b2df445ee..c82969f38845 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -211,8 +211,8 @@ struct pv_mmu_ops { void (*flush_tlb_user)(void); void (*flush_tlb_kernel)(void); void (*flush_tlb_one_user)(unsigned long addr); - void (*flush_tlb_others)(const struct cpumask *cpus, - const struct flush_tlb_info *info); + void (*flush_tlb_multi)(const struct cpumask *cpus, + const struct flush_tlb_info *info); void (*tlb_remove_table)(struct mmu_gather *tlb, void *table); diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index dee375831962..610e47dc66ef 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -517,7 +517,7 @@ static inline void __flush_tlb_one_kernel(unsigned long addr) * - flush_tlb_page(vma, vmaddr) flushes one page * - flush_tlb_range(vma, start, end) flushes a range of pages * - flush_tlb_kernel_range(start, end) flushes a range of kernel pages - * - flush_tlb_others(cpumask, info) flushes TLBs on other cpus + * - flush_tlb_multi(cpumask, info) flushes TLBs on multiple cpus * * ..but the i386 has somewhat limited tlb flushing capabilities, * and page-granular flushes are available only on i486 and up. @@ -569,7 +569,7 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false); } -void native_flush_tlb_others(const struct cpumask *cpumask, +void native_flush_tlb_multi(const struct cpumask *cpumask, const struct flush_tlb_info *info); static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) @@ -593,8 +593,8 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); #ifndef CONFIG_PARAVIRT -#define flush_tlb_others(mask, info) \ - native_flush_tlb_others(mask, info) +#define flush_tlb_multi(mask, info) \ + native_flush_tlb_multi(mask, info) #define paravirt_tlb_remove_table(tlb, page) \ tlb_remove_page(tlb, (void *)(page)) diff --git a/arch/x86/include/asm/trace/hyperv.h b/arch/x86/include/asm/trace/hyperv.h index ace464f09681..85ca8560c7f9 100644 --- a/arch/x86/include/asm/trace/hyperv.h +++ b/arch/x86/include/asm/trace/hyperv.h @@ -8,7 +8,7 @@ #if IS_ENABLED(CONFIG_HYPERV) -TRACE_EVENT(hyperv_mmu_flush_tlb_others, +TRACE_EVENT(hyperv_mmu_flush_tlb_multi, TP_PROTO(const struct cpumask *cpus, const struct flush_tlb_info *info), TP_ARGS(cpus, info), diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index b7f34fe2171e..de40657d9025 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -595,7 +595,7 @@ static void __init kvm_apf_trap_init(void) static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask); -static void kvm_flush_tlb_others(const struct cpumask *cpumask, +static void kvm_flush_tlb_multi(const struct cpumask *cpumask, const struct flush_tlb_info *info) { u8 state; @@ -609,6 +609,11 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask, * queue flush_on_enter for pre-empted vCPUs */ for_each_cpu(cpu, flushmask) { + /* + * The local vCPU is never preempted, so we do not explicitly + * skip check for local vCPU - it will never be cleared from + * flushmask. + */ src = &per_cpu(steal_time, cpu); state = READ_ONCE(src->preempted); if ((state & KVM_VCPU_PREEMPTED)) { @@ -618,7 +623,7 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask, } } - native_flush_tlb_others(flushmask, info); + native_flush_tlb_multi(flushmask, info); } static void __init kvm_guest_init(void) @@ -643,7 +648,7 @@ static void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) && !kvm_para_has_hint(KVM_HINTS_REALTIME) && kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { - pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others; + pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi; pv_ops.mmu.tlb_remove_table = tlb_remove_table; } diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 0aa6256eedd8..6af40844a730 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -363,7 +363,7 @@ struct paravirt_patch_template pv_ops = { .mmu.flush_tlb_user = native_flush_tlb, .mmu.flush_tlb_kernel = native_flush_tlb_global, .mmu.flush_tlb_one_user = native_flush_tlb_one_user, - .mmu.flush_tlb_others = native_flush_tlb_others, + .mmu.flush_tlb_multi = native_flush_tlb_multi, .mmu.tlb_remove_table (void (*)(struct mmu_gather *, void *))tlb_remove_page, diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index abbf55fa8b81..63c00908bdd9 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -551,7 +551,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f, * garbage into our TLB. Since switching to init_mm is barely * slower than a minimal flush, just switch to init_mm. * - * This should be rare, with native_flush_tlb_others skipping + * This should be rare, with native_flush_tlb_multi() skipping * IPIs to lazy TLB mode CPUs. */ switch_mm_irqs_off(NULL, &init_mm, NULL); @@ -665,9 +665,14 @@ static bool tlb_is_not_lazy(int cpu) static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask); -void native_flush_tlb_others(const struct cpumask *cpumask, - const struct flush_tlb_info *info) +void native_flush_tlb_multi(const struct cpumask *cpumask, + const struct flush_tlb_info *info) { + /* + * Do accounting and tracing. Note that there are (and have always been) + * cases in which a remote TLB flush will be traced, but eventually + * would not happen. + */ count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); if (info->end == TLB_FLUSH_ALL) trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL); @@ -687,10 +692,12 @@ void native_flush_tlb_others(const struct cpumask *cpumask, * means that the percpu tlb_gen variables won't be updated * and we'll do pointless flushes on future context switches. * - * Rather than hooking native_flush_tlb_others() here, I think + * Rather than hooking native_flush_tlb_multi() here, I think * that UV should be updated so that smp_call_function_many(), * etc, are optimal on UV. */ + flush_tlb_func_local((void *)info); + cpumask = uv_flush_tlb_others(cpumask, info); if (cpumask) smp_call_function_many(cpumask, flush_tlb_func_remote, @@ -709,8 +716,9 @@ void native_flush_tlb_others(const struct cpumask *cpumask, * doing a speculative memory access. */ if (info->freed_tables) { - smp_call_function_many(cpumask, flush_tlb_func_remote, - (void *)info, 1); + __smp_call_function_many(cpumask, flush_tlb_func_remote, + flush_tlb_func_local, + (void *)info, 1); } else { /* * Although we could have used on_each_cpu_cond_mask(), @@ -737,7 +745,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask, if (tlb_is_not_lazy(cpu)) __cpumask_set_cpu(cpu, cond_cpumask); } - smp_call_function_many(cond_cpumask, flush_tlb_func_remote, + __smp_call_function_many(cond_cpumask, flush_tlb_func_remote, + flush_tlb_func_local, (void *)info, 1); } } @@ -818,16 +827,20 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables, new_tlb_gen); - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { + /* + * flush_tlb_multi() is not optimized for the common case in which only + * a local TLB flush is needed. Optimize this use-case by calling + * flush_tlb_func_local() directly in this case. + */ + if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) { + flush_tlb_multi(mm_cpumask(mm), info); + } else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { lockdep_assert_irqs_enabled(); local_irq_disable(); flush_tlb_func_local(info); local_irq_enable(); } - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), info); - put_flush_tlb_info(); put_cpu(); } @@ -890,16 +903,20 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) { int cpu = get_cpu(); - if (cpumask_test_cpu(cpu, &batch->cpumask)) { + /* + * flush_tlb_multi() is not optimized for the common case in which only + * a local TLB flush is needed. Optimize this use-case by calling + * flush_tlb_func_local() directly in this case. + */ + if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) { + flush_tlb_multi(&batch->cpumask, &full_flush_tlb_info); + } else if (cpumask_test_cpu(cpu, &batch->cpumask)) { lockdep_assert_irqs_enabled(); local_irq_disable(); flush_tlb_func_local((void *)&full_flush_tlb_info); local_irq_enable(); } - if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) - flush_tlb_others(&batch->cpumask, &full_flush_tlb_info); - cpumask_clear(&batch->cpumask); put_cpu(); diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 26e8b326966d..48f7c7eb4dbc 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -1345,8 +1345,8 @@ static void xen_flush_tlb_one_user(unsigned long addr) preempt_enable(); } -static void xen_flush_tlb_others(const struct cpumask *cpus, - const struct flush_tlb_info *info) +static void xen_flush_tlb_multi(const struct cpumask *cpus, + const struct flush_tlb_info *info) { struct { struct mmuext_op op; @@ -1356,7 +1356,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus, const size_t mc_entry_size = sizeof(args->op) + sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus()); - trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end); + trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end); if (cpumask_empty(cpus)) return; /* nothing to do */ @@ -1365,9 +1365,8 @@ static void xen_flush_tlb_others(const struct cpumask *cpus, args = mcs.args; args->op.arg2.vcpumask = to_cpumask(args->mask); - /* Remove us, and any offline CPUS. */ + /* Remove any offline CPUs */ cpumask_and(to_cpumask(args->mask), cpus, cpu_online_mask); - cpumask_clear_cpu(smp_processor_id(), to_cpumask(args->mask)); args->op.cmd = MMUEXT_TLB_FLUSH_MULTI; if (info->end != TLB_FLUSH_ALL && @@ -2396,7 +2395,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = { .flush_tlb_user = xen_flush_tlb, .flush_tlb_kernel = xen_flush_tlb, .flush_tlb_one_user = xen_flush_tlb_one_user, - .flush_tlb_others = xen_flush_tlb_others, + .flush_tlb_multi = xen_flush_tlb_multi, .tlb_remove_table = tlb_remove_table, .pgd_alloc = xen_pgd_alloc, diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h index 9a0e8af21310..546022acf160 100644 --- a/include/trace/events/xen.h +++ b/include/trace/events/xen.h @@ -362,7 +362,7 @@ TRACE_EVENT(xen_mmu_flush_tlb_one_user, TP_printk("addr %lx", __entry->addr) ); -TRACE_EVENT(xen_mmu_flush_tlb_others, +TRACE_EVENT(xen_mmu_flush_tlb_multi, TP_PROTO(const struct cpumask *cpus, struct mm_struct *mm, unsigned long addr, unsigned long end), TP_ARGS(cpus, mm, addr, end), -- 2.20.1
Thanks for doing this, it's something I've been hoping someone would do for a long time. While I kinda wish we had performance data for each individual patch (at least the behavior-changing ones), this does look like a good improvement. That might, for instance, tell is a bit about how the separating out "is_lazy" compares to the "check before setting" optimization. But, they're both sane enough on their own that I'm not too worried. I had some nits that I hope get covered in later revisions, if sent. But, overall looks fine. For the series: Reviewed-by: Dave Hansen <dave.hansen at linux.intel.com>
Peter Zijlstra
2019-Jul-22 19:14 UTC
[PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
On Thu, Jul 18, 2019 at 05:58:32PM -0700, Nadav Amit wrote:> @@ -709,8 +716,9 @@ void native_flush_tlb_others(const struct cpumask *cpumask, > * doing a speculative memory access. > */ > if (info->freed_tables) { > - smp_call_function_many(cpumask, flush_tlb_func_remote, > - (void *)info, 1); > + __smp_call_function_many(cpumask, flush_tlb_func_remote, > + flush_tlb_func_local, > + (void *)info, 1); > } else { > /* > * Although we could have used on_each_cpu_cond_mask(), > @@ -737,7 +745,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask, > if (tlb_is_not_lazy(cpu)) > __cpumask_set_cpu(cpu, cond_cpumask); > } > - smp_call_function_many(cond_cpumask, flush_tlb_func_remote, > + __smp_call_function_many(cond_cpumask, flush_tlb_func_remote, > + flush_tlb_func_local, > (void *)info, 1); > } > }Do we really need that _local/_remote distinction? ISTR you had a patch that frobbed flush_tlb_info into the csd and that gave space constraints, but I'm not seeing that here (probably a wise, get stuff merged etc..). struct __call_single_data { struct llist_node llist; /* 0 8 */ smp_call_func_t func; /* 8 8 */ void * info; /* 16 8 */ unsigned int flags; /* 24 4 */ /* size: 32, cachelines: 1, members: 4 */ /* padding: 4 */ /* last cacheline: 32 bytes */ }; struct flush_tlb_info { struct mm_struct * mm; /* 0 8 */ long unsigned int start; /* 8 8 */ long unsigned int end; /* 16 8 */ u64 new_tlb_gen; /* 24 8 */ unsigned int stride_shift; /* 32 4 */ bool freed_tables; /* 36 1 */ /* size: 40, cachelines: 1, members: 6 */ /* padding: 3 */ /* last cacheline: 40 bytes */ }; IIRC what you did was make void *__call_single_data::info the last member and a union until the full cacheline size (64). Given the above that would get us 24 bytes for csd, leaving us 40 for that flush_tlb_info. But then we can still do something like the below, which doesn't change things and still gets rid of that dual function crud, simplifying smp_call_function_many again. Index: linux-2.6/arch/x86/include/asm/tlbflush.h ==================================================================--- linux-2.6.orig/arch/x86/include/asm/tlbflush.h +++ linux-2.6/arch/x86/include/asm/tlbflush.h @@ -546,8 +546,9 @@ struct flush_tlb_info { unsigned long start; unsigned long end; u64 new_tlb_gen; - unsigned int stride_shift; - bool freed_tables; + unsigned int cpu; + unsigned short stride_shift; + unsigned char freed_tables; }; #define local_flush_tlb() __flush_tlb() Index: linux-2.6/arch/x86/mm/tlb.c ==================================================================--- linux-2.6.orig/arch/x86/mm/tlb.c +++ linux-2.6/arch/x86/mm/tlb.c @@ -659,6 +659,27 @@ static void flush_tlb_func_remote(void * flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN); } +static void flush_tlb_func(void *info) +{ + const struct flush_tlb_info *f = info; + enum tlb_flush_reason reason = TLB_REMOTE_SHOOTDOWN; + bool local = false; + + if (f->cpu == smp_processor_id()) { + local = true; + reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : TLB_LOCAL_MM_SHOOTDOWN; + } else { + inc_irq_stat(irq_tlb_count); + + if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm)) + return; + + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); + } + + flush_tlb_func_common(f, local, reason); +} + static bool tlb_is_not_lazy(int cpu) { return !per_cpu(cpu_tlbstate_shared.is_lazy, cpu);
Juergen Gross
2019-Jul-26 07:28 UTC
[PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
On 19.07.19 02:58, Nadav Amit wrote:> To improve TLB shootdown performance, flush the remote and local TLBs > concurrently. Introduce flush_tlb_multi() that does so. Introduce > paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen > and hyper-v are only compile-tested). > > While the updated smp infrastructure is capable of running a function on > a single local core, it is not optimized for this case. The multiple > function calls and the indirect branch introduce some overhead, and > might make local TLB flushes slower than they were before the recent > changes. > > Before calling the SMP infrastructure, check if only a local TLB flush > is needed to restore the lost performance in this common case. This > requires to check mm_cpumask() one more time, but unless this mask is > updated very frequently, this should impact performance negatively. > > Cc: "K. Y. Srinivasan" <kys at microsoft.com> > Cc: Haiyang Zhang <haiyangz at microsoft.com> > Cc: Stephen Hemminger <sthemmin at microsoft.com> > Cc: Sasha Levin <sashal at kernel.org> > Cc: Thomas Gleixner <tglx at linutronix.de> > Cc: Ingo Molnar <mingo at redhat.com> > Cc: Borislav Petkov <bp at alien8.de> > Cc: x86 at kernel.org > Cc: Juergen Gross <jgross at suse.com> > Cc: Paolo Bonzini <pbonzini at redhat.com> > Cc: Dave Hansen <dave.hansen at linux.intel.com> > Cc: Andy Lutomirski <luto at kernel.org> > Cc: Peter Zijlstra <peterz at infradead.org> > Cc: Boris Ostrovsky <boris.ostrovsky at oracle.com> > Cc: linux-hyperv at vger.kernel.org > Cc: linux-kernel at vger.kernel.org > Cc: virtualization at lists.linux-foundation.org > Cc: kvm at vger.kernel.org > Cc: xen-devel at lists.xenproject.org > Signed-off-by: Nadav Amit <namit at vmware.com> > --- > arch/x86/hyperv/mmu.c | 10 +++--- > arch/x86/include/asm/paravirt.h | 6 ++-- > arch/x86/include/asm/paravirt_types.h | 4 +-- > arch/x86/include/asm/tlbflush.h | 8 ++--- > arch/x86/include/asm/trace/hyperv.h | 2 +- > arch/x86/kernel/kvm.c | 11 +++++-- > arch/x86/kernel/paravirt.c | 2 +- > arch/x86/mm/tlb.c | 47 ++++++++++++++++++--------- > arch/x86/xen/mmu_pv.c | 11 +++---- > include/trace/events/xen.h | 2 +- > 10 files changed, 62 insertions(+), 41 deletions(-)Xen and paravirt parts: Reviewed-by: Juergen Gross <jgross at suse.com> Juergen
Michael Kelley
2019-Jul-31 00:13 UTC
[PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
From: Nadav Amit <namit at vmware.com> Sent: Thursday, July 18, 2019 5:59 PM> > To improve TLB shootdown performance, flush the remote and local TLBs > concurrently. Introduce flush_tlb_multi() that does so. Introduce > paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen > and hyper-v are only compile-tested). > > While the updated smp infrastructure is capable of running a function on > a single local core, it is not optimized for this case. The multiple > function calls and the indirect branch introduce some overhead, and > might make local TLB flushes slower than they were before the recent > changes. > > Before calling the SMP infrastructure, check if only a local TLB flush > is needed to restore the lost performance in this common case. This > requires to check mm_cpumask() one more time, but unless this mask is > updated very frequently, this should impact performance negatively. > > Cc: "K. Y. Srinivasan" <kys at microsoft.com> > Cc: Haiyang Zhang <haiyangz at microsoft.com> > Cc: Stephen Hemminger <sthemmin at microsoft.com> > Cc: Sasha Levin <sashal at kernel.org> > Cc: Thomas Gleixner <tglx at linutronix.de> > Cc: Ingo Molnar <mingo at redhat.com> > Cc: Borislav Petkov <bp at alien8.de> > Cc: x86 at kernel.org > Cc: Juergen Gross <jgross at suse.com> > Cc: Paolo Bonzini <pbonzini at redhat.com> > Cc: Dave Hansen <dave.hansen at linux.intel.com> > Cc: Andy Lutomirski <luto at kernel.org> > Cc: Peter Zijlstra <peterz at infradead.org> > Cc: Boris Ostrovsky <boris.ostrovsky at oracle.com> > Cc: linux-hyperv at vger.kernel.org > Cc: linux-kernel at vger.kernel.org > Cc: virtualization at lists.linux-foundation.org > Cc: kvm at vger.kernel.org > Cc: xen-devel at lists.xenproject.org > Signed-off-by: Nadav Amit <namit at vmware.com> > --- > arch/x86/hyperv/mmu.c | 10 +++--- > arch/x86/include/asm/paravirt.h | 6 ++-- > arch/x86/include/asm/paravirt_types.h | 4 +-- > arch/x86/include/asm/tlbflush.h | 8 ++--- > arch/x86/include/asm/trace/hyperv.h | 2 +- > arch/x86/kernel/kvm.c | 11 +++++-- > arch/x86/kernel/paravirt.c | 2 +- > arch/x86/mm/tlb.c | 47 ++++++++++++++++++--------- > arch/x86/xen/mmu_pv.c | 11 +++---- > include/trace/events/xen.h | 2 +- > 10 files changed, 62 insertions(+), 41 deletions(-) >For the Hyper-V parts -- Reviewed-by: Michael Kelley <mikelley at microsoft.com>