thr3ads.net - Linux Virtualization - [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path [Nov 2017]

If this information is useful, please help other people find it:
Share via:

Quan Xu

2017-Nov-16 09:12 UTC

[PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

On 2017-11-16 06:03, Thomas Gleixner wrote:> On Wed, 15 Nov 2017, Peter Zijlstra wrote:
>
>> On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote:
>>> From: Yang Zhang <yang.zhang.wz at gmail.com>
>>>
>>> Implement a generic idle poll which resembles the functionality
>>> found in arch/. Provide weak arch_cpu_idle_poll function which
>>> can be overridden by the architecture code if needed.
>> No, we want less of those magic hooks, not more.
>>
>>> Interrupts arrive which may not cause a reschedule in idle loops.
>>> In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry
>>> for interrupts and VM-exit immediately. Also this becomes more
>>> expensive than bare metal. Add a generic idle poll before enter
>>> real idle path. When a reschedule event is pending, we can bypass
>>> the real idle path.
>> Why not do a HV specific idle driver?
> If I understand the problem correctly then he wants to avoid the heavy
> lifting in tick_nohz_idle_enter() in the first place, but there is already
> an interesting quirk there which makes it exit early.  See commit
> 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for
this commit
> looks similar. But lets not proliferate that. I'd rather see that go
away.
agreed.

Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce 
arch_needs_cpu")
in kvm guest. I won't proliferate that..
> But the irq_timings stuff is heading into the same direction, with a more
> complex prediction logic which should tell you pretty good how long that
> idle period is going to be and in case of an interrupt heavy workload this
> would skip the extra work of stopping and restarting the tick and provide a
> very good input into a polling decision.

interesting. I have tested with IRQ_TIMINGS related code, which seems 
not working so far.
Also I'd like to help as much as I can.> This can be handled either in a HV specific idle driver or even in the
> generic core code. If the interrupt does not arrive then you can assume
> within the predicted time then you can assume that the flood stopped and
> invoke halt or whatever.
>
> That avoids all of that 'tunable and tweakable' x86 specific
hackery and
> utilizes common functionality which is mostly there already.here is some sample code. Poll for a while before enter halt in 
cpuidle_enter_state()
If I get a reschedule event, then don't try to enter halt.? (I hope this 
is the right direction as Peter mentioned in another email)

--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
 ??????????????? target_state = &drv->states[index];
 ??????? }

+#ifdef CONFIG_PARAVIRT
+?????? paravirt_idle_poll();
+
+?????? if (need_resched())
+?????????????? return -EBUSY;
+#endif
+
 ??????? /* Take note of the planned idle state. */
 ??????? sched_idle_set_state(target_state);




thanks,

Quan
Alibaba Cloud

Quan Xu

2017-Nov-17 11:23 UTC

head link

[PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

On 2017-11-16 17:53, Thomas Gleixner wrote:> On Thu, 16 Nov 2017, Quan Xu wrote:
>> On 2017-11-16 06:03, Thomas Gleixner wrote:
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device
*dev,
>> struct cpuidle_driver *drv,
>>  ??????????????? target_state = &drv->states[index];
>>  ??????? }
>>
>> +#ifdef CONFIG_PARAVIRT
>> +?????? paravirt_idle_poll();
>> +
>> +?????? if (need_resched())
>> +?????????????? return -EBUSY;
>> +#endif
> That's just plain wrong. We don't want to see any of this PARAVIRT
crap in
> anything outside the architecture/hypervisor interfacing code which really
> needs it.
>
> The problem can and must be solved at the generic level in the first place
> to gather the data which can be used to make such decisions.
>
> How that information is used might be either completely generic or requires
> system specific variants. But as long as we don't have any information
at
> all we cannot discuss that.
>
> Please sit down and write up which data needs to be considered to make
> decisions about probabilistic polling. Then we need to compare and contrast
> that with the data which is necessary to make power/idle state decisions.
>
> I would be very surprised if this data would not overlap by at least 90%.
>
Peter, tglx
Thanks for your comments..

rethink of this patch set,

1. which data needs to considerd to make decisions about probabilistic 
polling

I really need to write up which data needs to considerd to make
decisions about probabilistic polling. At last several months,
I always focused on the data _from idle to reschedule_, then to bypass
the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
code inevitably.

with tglx's suggestion, the data which is necessary to make power/idle
state decisions, is the last idle state's residency time. IIUC this data
is duration from idle to wakeup, which maybe by reschedule irq or other irq.

I also test that the reschedule irq overlap by more than 90% (trace the
need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
one minute.

as the overlap, I think I can input the last idle state's residency time
to make decisions about probabilistic polling, as @dev->last_residency does.
it is much easier to get data.

2. do a HV specific idle driver (function)

so far, power management is not exposed to guest.. idle is simple for 
KVM guest,
calling "sti" / "hlt"(cpuidle_idle_call() -->
default_idle_call())..
thanks Xen guys, who has implemented the paravirt framework. I can 
implement it
as easy as following:

 ???????????? --- a/arch/x86/kernel/kvm.c
 ???????????? +++ b/arch/x86/kernel/kvm.c
 ???????????? @@ -465,6 +465,12 @@ static void __init 
kvm_apf_trap_init(void)
 ???????????????????? update_intr_gate(X86_TRAP_PF, async_page_fault);
 ????????????? }

 ???????????? +static __cpuidle void kvm_safe_halt(void)
 ???????????? +{
 ??? ???? +??????? /* 1. POLL, if need_resched() --> return */
 ??? ???? +
 ???????????? +??????? asm volatile("sti; hlt": :
:"memory"); /* 2. halt */
 ???????????? +
 ??? ???? +??????? /* 3. get the last idle state's residency time */
 ???????????? +
 ??? ???? +??????? /* 4. update poll duration based on last idle state's 
residency time */
 ???????????? +}
 ???????????? +
 ????????????? void __init kvm_guest_init(void)
 ????????????? {
 ???????????????????? int i;
 ???????????? @@ -490,6 +496,8 @@ void __init kvm_guest_init(void)
 ???????????????????? if (kvmclock_vsyscall)
 ???????????????????????????? kvm_setup_vsyscall_timeinfo();

 ???????????? +?????? pv_irq_ops.safe_halt = kvm_safe_halt;
 ???????????? +
 ????????????? #ifdef CONFIG_SMP

then, I am no need to introduce a new pvops, and never modify 
schedule/idle/nohz code again.
also I can narrow all of the code down in arch/x86/kernel/kvm.c.

If this is in the right direction, I will send a new patch set next week..

thanks,

Quan
Alibaba Cloud

Maybe Matching Threads

Search for more reasonably related threads

Linux Virtualization - Nov 2017 - [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

[PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

[PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

Maybe Matching Threads