Quan Xu
2017-Nov-16 09:12 UTC
[PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-16 06:03, Thomas Gleixner wrote:> On Wed, 15 Nov 2017, Peter Zijlstra wrote: > >> On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote: >>> From: Yang Zhang <yang.zhang.wz at gmail.com> >>> >>> Implement a generic idle poll which resembles the functionality >>> found in arch/. Provide weak arch_cpu_idle_poll function which >>> can be overridden by the architecture code if needed. >> No, we want less of those magic hooks, not more. >> >>> Interrupts arrive which may not cause a reschedule in idle loops. >>> In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry >>> for interrupts and VM-exit immediately. Also this becomes more >>> expensive than bare metal. Add a generic idle poll before enter >>> real idle path. When a reschedule event is pending, we can bypass >>> the real idle path. >> Why not do a HV specific idle driver? > If I understand the problem correctly then he wants to avoid the heavy > lifting in tick_nohz_idle_enter() in the first place, but there is already > an interesting quirk there which makes it exit early. See commit > 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this commit > looks similar. But lets not proliferate that. I'd rather see that go away.agreed. Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu") in kvm guest. I won't proliferate that..> But the irq_timings stuff is heading into the same direction, with a more > complex prediction logic which should tell you pretty good how long that > idle period is going to be and in case of an interrupt heavy workload this > would skip the extra work of stopping and restarting the tick and provide a > very good input into a polling decision.interesting. I have tested with IRQ_TIMINGS related code, which seems not working so far. Also I'd like to help as much as I can.> This can be handled either in a HV specific idle driver or even in the > generic core code. If the interrupt does not arrive then you can assume > within the predicted time then you can assume that the flood stopped and > invoke halt or whatever. > > That avoids all of that 'tunable and tweakable' x86 specific hackery and > utilizes common functionality which is mostly there already.here is some sample code. Poll for a while before enter halt in cpuidle_enter_state() If I get a reschedule event, then don't try to enter halt.? (I hope this is the right direction as Peter mentioned in another email) --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, ??????????????? target_state = &drv->states[index]; ??????? } +#ifdef CONFIG_PARAVIRT +?????? paravirt_idle_poll(); + +?????? if (need_resched()) +?????????????? return -EBUSY; +#endif + ??????? /* Take note of the planned idle state. */ ??????? sched_idle_set_state(target_state); thanks, Quan Alibaba Cloud
Quan Xu
2017-Nov-17 11:23 UTC
[PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-16 17:53, Thomas Gleixner wrote:> On Thu, 16 Nov 2017, Quan Xu wrote: >> On 2017-11-16 06:03, Thomas Gleixner wrote: >> --- a/drivers/cpuidle/cpuidle.c >> +++ b/drivers/cpuidle/cpuidle.c >> @@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, >> struct cpuidle_driver *drv, >> ??????????????? target_state = &drv->states[index]; >> ??????? } >> >> +#ifdef CONFIG_PARAVIRT >> +?????? paravirt_idle_poll(); >> + >> +?????? if (need_resched()) >> +?????????????? return -EBUSY; >> +#endif > That's just plain wrong. We don't want to see any of this PARAVIRT crap in > anything outside the architecture/hypervisor interfacing code which really > needs it. > > The problem can and must be solved at the generic level in the first place > to gather the data which can be used to make such decisions. > > How that information is used might be either completely generic or requires > system specific variants. But as long as we don't have any information at > all we cannot discuss that. > > Please sit down and write up which data needs to be considered to make > decisions about probabilistic polling. Then we need to compare and contrast > that with the data which is necessary to make power/idle state decisions. > > I would be very surprised if this data would not overlap by at least 90%. >Peter, tglx Thanks for your comments.. rethink of this patch set, 1. which data needs to considerd to make decisions about probabilistic polling I really need to write up which data needs to considerd to make decisions about probabilistic polling. At last several months, I always focused on the data _from idle to reschedule_, then to bypass the idle loops. unfortunately, this makes me touch scheduler/idle/nohz code inevitably. with tglx's suggestion, the data which is necessary to make power/idle state decisions, is the last idle state's residency time. IIUC this data is duration from idle to wakeup, which maybe by reschedule irq or other irq. I also test that the reschedule irq overlap by more than 90% (trace the need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for one minute. as the overlap, I think I can input the last idle state's residency time to make decisions about probabilistic polling, as @dev->last_residency does. it is much easier to get data. 2. do a HV specific idle driver (function) so far, power management is not exposed to guest.. idle is simple for KVM guest, calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call()).. thanks Xen guys, who has implemented the paravirt framework. I can implement it as easy as following: ???????????? --- a/arch/x86/kernel/kvm.c ???????????? +++ b/arch/x86/kernel/kvm.c ???????????? @@ -465,6 +465,12 @@ static void __init kvm_apf_trap_init(void) ???????????????????? update_intr_gate(X86_TRAP_PF, async_page_fault); ????????????? } ???????????? +static __cpuidle void kvm_safe_halt(void) ???????????? +{ ??? ???? +??????? /* 1. POLL, if need_resched() --> return */ ??? ???? + ???????????? +??????? asm volatile("sti; hlt": : :"memory"); /* 2. halt */ ???????????? + ??? ???? +??????? /* 3. get the last idle state's residency time */ ???????????? + ??? ???? +??????? /* 4. update poll duration based on last idle state's residency time */ ???????????? +} ???????????? + ????????????? void __init kvm_guest_init(void) ????????????? { ???????????????????? int i; ???????????? @@ -490,6 +496,8 @@ void __init kvm_guest_init(void) ???????????????????? if (kvmclock_vsyscall) ???????????????????????????? kvm_setup_vsyscall_timeinfo(); ???????????? +?????? pv_irq_ops.safe_halt = kvm_safe_halt; ???????????? + ????????????? #ifdef CONFIG_SMP then, I am no need to introduce a new pvops, and never modify schedule/idle/nohz code again. also I can narrow all of the code down in arch/x86/kernel/kvm.c. If this is in the right direction, I will send a new patch set next week.. thanks, Quan Alibaba Cloud
Reasonably Related Threads
- [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
- [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
- [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
- [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
- [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path