From: Yang Zhang <yang.zhang.wz at gmail.com> Some latency-intensive workload have seen obviously performance drop when running inside VM. The main reason is that the overhead is amplified when running inside VM. The most cost I have seen is inside idle path. This patch introduces a new mechanism to poll for a while before entering idle state. If schedule is needed during poll, then we don't need to goes through the heavy overhead path. Here is the data we get when running benchmark contextswitch to measure the latency(lower is better): 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 3402.9 ns/ctxsw -- 199.8 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): halt_poll_threshold=10000 -- 1151.4 ns/ctxsw -- 200.1 %CPU halt_poll_threshold=20000 -- 1149.7 ns/ctxsw -- 199.9 %CPU halt_poll_threshold=30000 -- 1151.0 ns/ctxsw -- 199.9 %CPU halt_poll_threshold=40000 -- 1155.4 ns/ctxsw -- 199.3 %CPU halt_poll_threshold=50000 -- 1161.0 ns/ctxsw -- 200.0 %CPU halt_poll_threshold=100000 -- 1163.8 ns/ctxsw -- 200.4 %CPU halt_poll_threshold=300000 -- 1159.4 ns/ctxsw -- 201.9 %CPU halt_poll_threshold=500000 -- 1163.5 ns/ctxsw -- 205.5 %CPU 3. w/ kvm dynamic poll: halt_poll_ns=10000 -- 3470.5 ns/ctxsw -- 199.6 %CPU halt_poll_ns=20000 -- 3273.0 ns/ctxsw -- 199.7 %CPU halt_poll_ns=30000 -- 3628.7 ns/ctxsw -- 199.4 %CPU halt_poll_ns=40000 -- 2280.6 ns/ctxsw -- 199.5 %CPU halt_poll_ns=50000 -- 3200.3 ns/ctxsw -- 199.7 %CPU halt_poll_ns=100000 -- 2186.6 ns/ctxsw -- 199.6 %CPU halt_poll_ns=300000 -- 3178.7 ns/ctxsw -- 199.6 %CPU halt_poll_ns=500000 -- 3505.4 ns/ctxsw -- 199.7 %CPU 4. w/patch and w/ kvm dynamic poll: halt_poll_ns=10000 & halt_poll_threshold=10000 -- 1155.5 ns/ctxsw -- 199.8 %CPU halt_poll_ns=10000 & halt_poll_threshold=20000 -- 1165.6 ns/ctxsw -- 199.8 %CPU halt_poll_ns=10000 & halt_poll_threshold=30000 -- 1161.1 ns/ctxsw -- 200.0 %CPU halt_poll_ns=20000 & halt_poll_threshold=10000 -- 1158.1 ns/ctxsw -- 199.8 %CPU halt_poll_ns=20000 & halt_poll_threshold=20000 -- 1161.0 ns/ctxsw -- 199.7 %CPU halt_poll_ns=20000 & halt_poll_threshold=30000 -- 1163.7 ns/ctxsw -- 199.9 %CPU halt_poll_ns=30000 & halt_poll_threshold=10000 -- 1158.7 ns/ctxsw -- 199.7 %CPU halt_poll_ns=30000 & halt_poll_threshold=20000 -- 1153.8 ns/ctxsw -- 199.8 %CPU halt_poll_ns=30000 & halt_poll_threshold=30000 -- 1155.1 ns/ctxsw -- 199.8 %CPU 5. idle=poll 3957.57 ns/ctxsw -- 999.4%CPU Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): halt_poll_threshold=10000 -- 29021.7 bit/s -- 105.1 %CPU halt_poll_threshold=20000 -- 33463.5 bit/s -- 128.2 %CPU halt_poll_threshold=30000 -- 34436.4 bit/s -- 127.8 %CPU halt_poll_threshold=40000 -- 35563.3 bit/s -- 129.6 %CPU halt_poll_threshold=50000 -- 35787.7 bit/s -- 129.4 %CPU halt_poll_threshold=100000 -- 35477.7 bit/s -- 130.0 %CPU halt_poll_threshold=300000 -- 35730.0 bit/s -- 132.4 %CPU halt_poll_threshold=500000 -- 34978.4 bit/s -- 134.2 %CPU 3. w/ kvm dynamic poll: halt_poll_ns=10000 -- 28849.8 bit/s -- 75.2 %CPU halt_poll_ns=20000 -- 29004.8 bit/s -- 76.1 %CPU halt_poll_ns=30000 -- 35662.0 bit/s -- 199.7 %CPU halt_poll_ns=40000 -- 35874.8 bit/s -- 187.5 %CPU halt_poll_ns=50000 -- 35603.1 bit/s -- 199.8 %CPU halt_poll_ns=100000 -- 35588.8 bit/s -- 200.0 %CPU halt_poll_ns=300000 -- 35912.4 bit/s -- 200.0 %CPU halt_poll_ns=500000 -- 35735.6 bit/s -- 200.0 %CPU 4. w/patch and w/ kvm dynamic poll: halt_poll_ns=10000 & halt_poll_threshold=10000 -- 29427.9 bit/s -- 107.8 %CPU halt_poll_ns=10000 & halt_poll_threshold=20000 -- 33048.4 bit/s -- 128.1 %CPU halt_poll_ns=10000 & halt_poll_threshold=30000 -- 35129.8 bit/s -- 129.1 %CPU halt_poll_ns=20000 & halt_poll_threshold=10000 -- 31091.3 bit/s -- 130.3 %CPU halt_poll_ns=20000 & halt_poll_threshold=20000 -- 33587.9 bit/s -- 128.9 %CPU halt_poll_ns=20000 & halt_poll_threshold=30000 -- 35532.9 bit/s -- 129.1 %CPU halt_poll_ns=30000 & halt_poll_threshold=10000 -- 35633.1 bit/s -- 199.4 %CPU halt_poll_ns=30000 & halt_poll_threshold=20000 -- 42225.3 bit/s -- 198.7 %CPU halt_poll_ns=30000 & halt_poll_threshold=30000 -- 42210.7 bit/s -- 200.3 %CPU 5. idle=poll 37081.7 bit/s -- 998.1 %CPU --- V2 -> V3: - move poll update into arch/. in v3, poll update is based on duration of the last idle loop which is from tick_nohz_idle_enter to tick_nohz_idle_exit, and try our best not to interfere with scheduler/idle code. (This seems not to follow Peter's v2 comment, however we had a f2f discussion about it in Prague.) - enhance patch desciption. - enhance Documentation and sysctls. - test with IRQ_TIMINGS related code, which seems not working so far. V1 -> V2: - integrate the smart halt poll into paravirt code - use idle_stamp instead of check_poll - since it hard to get whether vcpu is the only task in pcpu, so we don't consider it in this series.(May improve it in future) --- Quan Xu (4): x86/paravirt: Add pv_idle_ops to paravirt ops KVM guest: register kvm_idle_poll for pv_idle_ops Documentation: Add three sysctls for smart idle poll tick: get duration of the last idle loop Yang Zhang (2): sched/idle: Add a generic poll before enter real idle path KVM guest: introduce smart idle poll algorithm Documentation/sysctl/kernel.txt | 35 ++++++++++++++++ arch/x86/include/asm/paravirt.h | 5 ++ arch/x86/include/asm/paravirt_types.h | 6 +++ arch/x86/kernel/kvm.c | 73 +++++++++++++++++++++++++++++++++ arch/x86/kernel/paravirt.c | 10 +++++ arch/x86/kernel/process.c | 7 +++ include/linux/kernel.h | 6 +++ include/linux/tick.h | 2 + kernel/sched/idle.c | 2 + kernel/sysctl.c | 34 +++++++++++++++ kernel/time/tick-sched.c | 11 +++++ kernel/time/tick-sched.h | 3 + 12 files changed, 194 insertions(+), 0 deletions(-)
Quan Xu
2017-Nov-13 10:06 UTC
[PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
From: Quan Xu <quan.xu0 at gmail.com> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang <yang.zhang.wz at gmail.com> Signed-off-by: Quan Xu <quan.xu0 at gmail.com> Cc: Juergen Gross <jgross at suse.com> Cc: Alok Kataria <akataria at vmware.com> Cc: Rusty Russell <rusty at rustcorp.com.au> Cc: Thomas Gleixner <tglx at linutronix.de> Cc: Ingo Molnar <mingo at redhat.com> Cc: "H. Peter Anvin" <hpa at zytor.com> Cc: x86 at kernel.org Cc: virtualization at lists.linux-foundation.org Cc: linux-kernel at vger.kernel.org Cc: xen-devel at lists.xenproject.org --- arch/x86/include/asm/paravirt.h | 5 +++++ arch/x86/include/asm/paravirt_types.h | 6 ++++++ arch/x86/kernel/paravirt.c | 6 ++++++ 3 files changed, 17 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index fd81228..3c83727 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -198,6 +198,11 @@ static inline unsigned long long paravirt_read_pmc(int counter) #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter)) +static inline void paravirt_idle_poll(void) +{ + PVOP_VCALL0(pv_idle_ops.poll); +} + static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries) { PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries); diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 10cc3b9..95c0e3e 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -313,6 +313,10 @@ struct pv_lock_ops { struct paravirt_callee_save vcpu_is_preempted; } __no_randomize_layout; +struct pv_idle_ops { + void (*poll)(void); +} __no_randomize_layout; + /* This contains all the paravirt structures: we get a convenient * number for each function using the offset which we use to indicate * what to patch. */ @@ -323,6 +327,7 @@ struct paravirt_patch_template { struct pv_irq_ops pv_irq_ops; struct pv_mmu_ops pv_mmu_ops; struct pv_lock_ops pv_lock_ops; + struct pv_idle_ops pv_idle_ops; } __no_randomize_layout; extern struct pv_info pv_info; @@ -332,6 +337,7 @@ struct paravirt_patch_template { extern struct pv_irq_ops pv_irq_ops; extern struct pv_mmu_ops pv_mmu_ops; extern struct pv_lock_ops pv_lock_ops; +extern struct pv_idle_ops pv_idle_ops; #define PARAVIRT_PATCH(x) \ (offsetof(struct paravirt_patch_template, x) / sizeof(void *)) diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 19a3e8f..67cab22 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -128,6 +128,7 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void *target, #ifdef CONFIG_PARAVIRT_SPINLOCKS .pv_lock_ops = pv_lock_ops, #endif + .pv_idle_ops = pv_idle_ops, }; return *((void **)&tmpl + type); } @@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = { .steal_clock = native_steal_clock, }; +struct pv_idle_ops pv_idle_ops = { + .poll = paravirt_nop, +}; + __visible struct pv_irq_ops pv_irq_ops = { .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl), .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl), @@ -463,3 +468,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = { EXPORT_SYMBOL (pv_mmu_ops); EXPORT_SYMBOL_GPL(pv_info); EXPORT_SYMBOL (pv_irq_ops); +EXPORT_SYMBOL (pv_idle_ops); -- 1.7.1
Quan Xu
2017-Nov-13 10:06 UTC
[PATCH RFC v3 2/6] KVM guest: register kvm_idle_poll for pv_idle_ops
From: Quan Xu <quan.xu0 at gmail.com> Although smart idle poll has nothing to do with paravirt, it can not bring any benifit to native. So we only enable it when Linux runs as a KVM guest( also it can extend to other hypervisor like Xen, HyperV and VMware). Introduce per-CPU variable poll_duration_ns to control the max poll time. Signed-off-by: Yang Zhang <yang.zhang.wz at gmail.com> Signed-off-by: Quan Xu <quan.xu0 at gmail.com> Cc: Paolo Bonzini <pbonzini at redhat.com> Cc: "Radim Kr?m??" <rkrcmar at redhat.com> Cc: Thomas Gleixner <tglx at linutronix.de> Cc: Ingo Molnar <mingo at redhat.com> Cc: "H. Peter Anvin" <hpa at zytor.com> Cc: x86 at kernel.org Cc: kvm at vger.kernel.org Cc: linux-kernel at vger.kernel.org --- arch/x86/kernel/kvm.c | 26 ++++++++++++++++++++++++++ 1 files changed, 26 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 8bb9594..2a6e402 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -75,6 +75,7 @@ static int parse_no_kvmclock_vsyscall(char *arg) early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall); +static DEFINE_PER_CPU(unsigned long, poll_duration_ns); static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64); static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); static int has_steal_clock = 0; @@ -364,6 +365,29 @@ static void kvm_guest_cpu_init(void) kvm_register_steal_time(); } +static void kvm_idle_poll(void) +{ + unsigned long poll_duration = this_cpu_read(poll_duration_ns); + ktime_t start, cur, stop; + + start = cur = ktime_get(); + stop = ktime_add_ns(ktime_get(), poll_duration); + + do { + if (need_resched()) + break; + cur = ktime_get(); + } while (ktime_before(cur, stop)); +} + +static void kvm_guest_idle_init(void) +{ + if (!kvm_para_available()) + return; + + pv_idle_ops.poll = kvm_idle_poll; +} + static void kvm_pv_disable_apf(void) { if (!__this_cpu_read(apf_reason.enabled)) @@ -499,6 +523,8 @@ void __init kvm_guest_init(void) kvm_guest_cpu_init(); #endif + kvm_guest_idle_init(); + /* * Hard lockup detection is enabled by default. Disable it, as guests * can get false positives too easily, for example if the host is -- 1.7.1
Quan Xu
2017-Nov-13 10:06 UTC
[PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
From: Yang Zhang <yang.zhang.wz at gmail.com> Implement a generic idle poll which resembles the functionality found in arch/. Provide weak arch_cpu_idle_poll function which can be overridden by the architecture code if needed. Interrupts arrive which may not cause a reschedule in idle loops. In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry for interrupts and VM-exit immediately. Also this becomes more expensive than bare metal. Add a generic idle poll before enter real idle path. When a reschedule event is pending, we can bypass the real idle path. Signed-off-by: Quan Xu <quan.xu0 at gmail.com> Signed-off-by: Yang Zhang <yang.zhang.wz at gmail.com> Cc: Thomas Gleixner <tglx at linutronix.de> Cc: Ingo Molnar <mingo at redhat.com> Cc: "H. Peter Anvin" <hpa at zytor.com> Cc: x86 at kernel.org Cc: Peter Zijlstra <peterz at infradead.org> Cc: Borislav Petkov <bp at alien8.de> Cc: Kyle Huey <me at kylehuey.com> Cc: Len Brown <len.brown at intel.com> Cc: Andy Lutomirski <luto at kernel.org> Cc: Tom Lendacky <thomas.lendacky at amd.com> Cc: Tobias Klauser <tklauser at distanz.ch> Cc: linux-kernel at vger.kernel.org --- arch/x86/kernel/process.c | 7 +++++++ kernel/sched/idle.c | 2 ++ 2 files changed, 9 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index c676853..f7db8b5 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -333,6 +333,13 @@ void arch_cpu_idle(void) x86_idle(); } +#ifdef CONFIG_PARAVIRT +void arch_cpu_idle_poll(void) +{ + paravirt_idle_poll(); +} +#endif + /* * We use this if we don't have any better idle routine.. */ diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 257f4f0..df7c422 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void) } /* Weak implementations for optional arch specific functions */ +void __weak arch_cpu_idle_poll(void) { } void __weak arch_cpu_idle_prepare(void) { } void __weak arch_cpu_idle_enter(void) { } void __weak arch_cpu_idle_exit(void) { } @@ -219,6 +220,7 @@ static void do_idle(void) */ __current_set_polling(); + arch_cpu_idle_poll(); quiet_vmstat(); tick_nohz_idle_enter(); -- 1.7.1
Juergen Gross
2017-Nov-13 10:53 UTC
[PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 13/11/17 11:06, Quan Xu wrote:> From: Quan Xu <quan.xu0 at gmail.com> > > So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called > in idle path which will poll for a while before we enter the real idle > state. > > In virtualization, idle path includes several heavy operations > includes timer access(LAPIC timer or TSC deadline timer) which will > hurt performance especially for latency intensive workload like message > passing task. The cost is mainly from the vmexit which is a hardware > context switch between virtual machine and hypervisor. Our solution is > to poll for a while and do not enter real idle path if we can get the > schedule event during polling. > > Poll may cause the CPU waste so we adopt a smart polling mechanism to > reduce the useless poll. > > Signed-off-by: Yang Zhang <yang.zhang.wz at gmail.com> > Signed-off-by: Quan Xu <quan.xu0 at gmail.com> > Cc: Juergen Gross <jgross at suse.com> > Cc: Alok Kataria <akataria at vmware.com> > Cc: Rusty Russell <rusty at rustcorp.com.au> > Cc: Thomas Gleixner <tglx at linutronix.de> > Cc: Ingo Molnar <mingo at redhat.com> > Cc: "H. Peter Anvin" <hpa at zytor.com> > Cc: x86 at kernel.org > Cc: virtualization at lists.linux-foundation.org > Cc: linux-kernel at vger.kernel.org > Cc: xen-devel at lists.xenproject.orgHmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Wouldn't a function pointer, maybe guarded by a static key, be enough? A further advantage would be that this would work on other architectures, too. Juergen
Peter Zijlstra
2017-Nov-15 12:11 UTC
[PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote:> From: Yang Zhang <yang.zhang.wz at gmail.com> > > Implement a generic idle poll which resembles the functionality > found in arch/. Provide weak arch_cpu_idle_poll function which > can be overridden by the architecture code if needed.No, we want less of those magic hooks, not more.> Interrupts arrive which may not cause a reschedule in idle loops. > In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry > for interrupts and VM-exit immediately. Also this becomes more > expensive than bare metal. Add a generic idle poll before enter > real idle path. When a reschedule event is pending, we can bypass > the real idle path.Why not do a HV specific idle driver?
Konrad Rzeszutek Wilk
2017-Nov-15 21:31 UTC
[Xen-devel] [PATCH RFC v3 0/6] x86/idle: add halt poll support
On Mon, Nov 13, 2017 at 06:05:59PM +0800, Quan Xu wrote:> From: Yang Zhang <yang.zhang.wz at gmail.com> > > Some latency-intensive workload have seen obviously performance > drop when running inside VM. The main reason is that the overhead > is amplified when running inside VM. The most cost I have seen is > inside idle path.Meaning an VMEXIT b/c it is an 'halt' operation ? And then going back in guest (VMRESUME) takes time. And hence your latency gets all whacked b/c of this? So if I understand - you want to use your _full_ timeslice (of the guest) without ever (or as much as possible) to go in the hypervisor? Which means in effect you don't care about power-saving or CPUfreq savings, you just want to eat the full CPU for snack?> > This patch introduces a new mechanism to poll for a while before > entering idle state. If schedule is needed during poll, then we > don't need to goes through the heavy overhead path.Schedule of what? The guest or the host?
Quan Xu
2017-Nov-20 07:18 UTC
[Xen-devel] [PATCH RFC v3 0/6] x86/idle: add halt poll support
On 2017-11-16 05:31, Konrad Rzeszutek Wilk wrote:> On Mon, Nov 13, 2017 at 06:05:59PM +0800, Quan Xu wrote: >> From: Yang Zhang <yang.zhang.wz at gmail.com> >> >> Some latency-intensive workload have seen obviously performance >> drop when running inside VM. The main reason is that the overhead >> is amplified when running inside VM. The most cost I have seen is >> inside idle path. > Meaning an VMEXIT b/c it is an 'halt' operation ? And then going > back in guest (VMRESUME) takes time. And hence your latency gets > all whacked b/c of this??? Konrad, I can't follow 'b/c' here.. sorry.> So if I understand - you want to use your _full_ timeslice (of the guest) > without ever (or as much as possible) to go in the hypervisor???? as much as possible.> Which means in effect you don't care about power-saving or CPUfreq > savings, you just want to eat the full CPU for snack?? actually, we? care about power-saving. The poll duration is self-tuning, otherwise it is almost as the same as ? 'halt=poll'. Also we always sent out with CPU usage of benchmark netperf/ctxsw. We got much more ? performance with limited promotion of CPU usage.>> This patch introduces a new mechanism to poll for a while before >> entering idle state. If schedule is needed during poll, then we >> don't need to goes through the heavy overhead path. > Schedule of what? The guest or the host?? rescheduled of guest scheduler.. ? it is the guest. Quan Alibaba Cloud> >
Possibly Parallel Threads
- [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
- [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
- [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
- [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
- [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path