thr3ads.net - Linux Virtualization - [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path [Nov 2017]

If this information is useful, please help other people find it:
Share via:

Quan Xu

2017-Nov-17 11:23 UTC

[PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

On 2017-11-16 17:53, Thomas Gleixner wrote:> On Thu, 16 Nov 2017, Quan Xu wrote:
>> On 2017-11-16 06:03, Thomas Gleixner wrote:
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device
*dev,
>> struct cpuidle_driver *drv,
>>  ??????????????? target_state = &drv->states[index];
>>  ??????? }
>>
>> +#ifdef CONFIG_PARAVIRT
>> +?????? paravirt_idle_poll();
>> +
>> +?????? if (need_resched())
>> +?????????????? return -EBUSY;
>> +#endif
> That's just plain wrong. We don't want to see any of this PARAVIRT
crap in
> anything outside the architecture/hypervisor interfacing code which really
> needs it.
>
> The problem can and must be solved at the generic level in the first place
> to gather the data which can be used to make such decisions.
>
> How that information is used might be either completely generic or requires
> system specific variants. But as long as we don't have any information
at
> all we cannot discuss that.
>
> Please sit down and write up which data needs to be considered to make
> decisions about probabilistic polling. Then we need to compare and contrast
> that with the data which is necessary to make power/idle state decisions.
>
> I would be very surprised if this data would not overlap by at least 90%.
>
Peter, tglx
Thanks for your comments..

rethink of this patch set,

1. which data needs to considerd to make decisions about probabilistic 
polling

I really need to write up which data needs to considerd to make
decisions about probabilistic polling. At last several months,
I always focused on the data _from idle to reschedule_, then to bypass
the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
code inevitably.

with tglx's suggestion, the data which is necessary to make power/idle
state decisions, is the last idle state's residency time. IIUC this data
is duration from idle to wakeup, which maybe by reschedule irq or other irq.

I also test that the reschedule irq overlap by more than 90% (trace the
need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
one minute.

as the overlap, I think I can input the last idle state's residency time
to make decisions about probabilistic polling, as @dev->last_residency does.
it is much easier to get data.

2. do a HV specific idle driver (function)

so far, power management is not exposed to guest.. idle is simple for 
KVM guest,
calling "sti" / "hlt"(cpuidle_idle_call() -->
default_idle_call())..
thanks Xen guys, who has implemented the paravirt framework. I can 
implement it
as easy as following:

 ???????????? --- a/arch/x86/kernel/kvm.c
 ???????????? +++ b/arch/x86/kernel/kvm.c
 ???????????? @@ -465,6 +465,12 @@ static void __init 
kvm_apf_trap_init(void)
 ???????????????????? update_intr_gate(X86_TRAP_PF, async_page_fault);
 ????????????? }

 ???????????? +static __cpuidle void kvm_safe_halt(void)
 ???????????? +{
 ??? ???? +??????? /* 1. POLL, if need_resched() --> return */
 ??? ???? +
 ???????????? +??????? asm volatile("sti; hlt": :
:"memory"); /* 2. halt */
 ???????????? +
 ??? ???? +??????? /* 3. get the last idle state's residency time */
 ???????????? +
 ??? ???? +??????? /* 4. update poll duration based on last idle state's 
residency time */
 ???????????? +}
 ???????????? +
 ????????????? void __init kvm_guest_init(void)
 ????????????? {
 ???????????????????? int i;
 ???????????? @@ -490,6 +496,8 @@ void __init kvm_guest_init(void)
 ???????????????????? if (kvmclock_vsyscall)
 ???????????????????????????? kvm_setup_vsyscall_timeinfo();

 ???????????? +?????? pv_irq_ops.safe_halt = kvm_safe_halt;
 ???????????? +
 ????????????? #ifdef CONFIG_SMP

then, I am no need to introduce a new pvops, and never modify 
schedule/idle/nohz code again.
also I can narrow all of the code down in arch/x86/kernel/kvm.c.

If this is in the right direction, I will send a new patch set next week..

thanks,

Quan
Alibaba Cloud

Thomas Gleixner

2017-Nov-17 11:36 UTC

head link

[PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

On Fri, 17 Nov 2017, Quan Xu wrote:> On 2017-11-16 17:53, Thomas Gleixner wrote:
> > That's just plain wrong. We don't want to see any of this
PARAVIRT crap in
> > anything outside the architecture/hypervisor interfacing code which
really
> > needs it.
> > 
> > The problem can and must be solved at the generic level in the first
place
> > to gather the data which can be used to make such decisions.
> > 
> > How that information is used might be either completely generic or
requires
> > system specific variants. But as long as we don't have any
information at
> > all we cannot discuss that.
> > 
> > Please sit down and write up which data needs to be considered to make
> > decisions about probabilistic polling. Then we need to compare and
contrast
> > that with the data which is necessary to make power/idle state
decisions.
> > 
> > I would be very surprised if this data would not overlap by at least
90%.
> > 
> 1. which data needs to considerd to make decisions about probabilistic
polling
> 
> I really need to write up which data needs to considerd to make
> decisions about probabilistic polling. At last several months,
> I always focused on the data _from idle to reschedule_, then to bypass
> the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
> code inevitably.
> 
> with tglx's suggestion, the data which is necessary to make power/idle
> state decisions, is the last idle state's residency time. IIUC this
data
> is duration from idle to wakeup, which maybe by reschedule irq or other
irq.
That's part of the picture, but not complete.
> I also test that the reschedule irq overlap by more than 90% (trace the
> need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
> one minute.
> 
> as the overlap, I think I can input the last idle state's residency
time
> to make decisions about probabilistic polling, as @dev->last_residency
does.
> it is much easier to get data.
That's only true for your particular use case.
> 
> 2. do a HV specific idle driver (function)
> 
> so far, power management is not exposed to guest.. idle is simple for KVM
> guest,
> calling "sti" / "hlt"(cpuidle_idle_call() -->
default_idle_call())..
> thanks Xen guys, who has implemented the paravirt framework. I can
implement
> it
> as easy as following:
> 
> ???????????? --- a/arch/x86/kernel/kvm.c
Your email client is using a very strange formatting. 
> ???????????? +++ b/arch/x86/kernel/kvm.c
> ???????????? @@ -465,6 +465,12 @@ static void __init
kvm_apf_trap_init(void)
> ???????????????????? update_intr_gate(X86_TRAP_PF, async_page_fault);
> ????????????? }
> 
> ???????????? +static __cpuidle void kvm_safe_halt(void)
> ???????????? +{
> ??? ???? +??????? /* 1. POLL, if need_resched() --> return */
> ??? ???? +
> ???????????? +??????? asm volatile("sti; hlt": :
:"memory"); /* 2. halt */
> ???????????? +
> ??? ???? +??????? /* 3. get the last idle state's residency time */
> ???????????? +
> ??? ???? +??????? /* 4. update poll duration based on last idle state's
> residency time */
> ???????????? +}
> ???????????? +
> ????????????? void __init kvm_guest_init(void)
> ????????????? {
> ???????????????????? int i;
> ???????????? @@ -490,6 +496,8 @@ void __init kvm_guest_init(void)
> ???????????????????? if (kvmclock_vsyscall)
> ???????????????????????????? kvm_setup_vsyscall_timeinfo();
> 
> ???????????? +?????? pv_irq_ops.safe_halt = kvm_safe_halt;
> ???????????? +
> ????????????? #ifdef CONFIG_SMP
> 
> 
> then, I am no need to introduce a new pvops, and never modify
> schedule/idle/nohz code again.
> also I can narrow all of the code down in arch/x86/kernel/kvm.c.
> 
> If this is in the right direction, I will send a new patch set next week..
This is definitely better than what you proposed so far and implementing it
as a prove of concept seems to be worthwhile.

But I doubt that this is the final solution. It's not generic and not
necessarily suitable for all use case scenarios.

Thanks,

	tglx

Quan Xu

2017-Nov-17 12:21 UTC

head link

[PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

On 2017-11-17 19:36, Thomas Gleixner wrote:> On Fri, 17 Nov 2017, Quan Xu wrote:
>> On 2017-11-16 17:53, Thomas Gleixner wrote:
>>> That's just plain wrong. We don't want to see any of this
PARAVIRT crap in
>>> anything outside the architecture/hypervisor interfacing code which
really
>>> needs it.
>>>
>>> The problem can and must be solved at the generic level in the
first place
>>> to gather the data which can be used to make such decisions.
>>>
>>> How that information is used might be either completely generic or
requires
>>> system specific variants. But as long as we don't have any
information at
>>> all we cannot discuss that.
>>>
>>> Please sit down and write up which data needs to be considered to
make
>>> decisions about probabilistic polling. Then we need to compare and
contrast
>>> that with the data which is necessary to make power/idle state
decisions.
>>>
>>> I would be very surprised if this data would not overlap by at
least 90%.
>>>
>> 1. which data needs to considerd to make decisions about probabilistic
polling
>>
>> I really need to write up which data needs to considerd to make
>> decisions about probabilistic polling. At last several months,
>> I always focused on the data _from idle to reschedule_, then to bypass
>> the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
>> code inevitably.
>>
>> with tglx's suggestion, the data which is necessary to make
power/idle
>> state decisions, is the last idle state's residency time. IIUC this
data
>> is duration from idle to wakeup, which maybe by reschedule irq or other
irq.
> That's part of the picture, but not complete.
tglx, could you share more? I am very curious about it..
>> I also test that the reschedule irq overlap by more than 90% (trace the
>> need_resched status after cpuidle_idle_call), when I run ctxsw/netperf
for
>> one minute.
>>
>> as the overlap, I think I can input the last idle state's residency
time
>> to make decisions about probabilistic polling, as
@dev->last_residency does.
>> it is much easier to get data.
> That's only true for your particular use case.
>
>> 2. do a HV specific idle driver (function)
>>
>> so far, power management is not exposed to guest.. idle is simple for
KVM
>> guest,
>> calling "sti" / "hlt"(cpuidle_idle_call() -->
default_idle_call())..
>> thanks Xen guys, who has implemented the paravirt framework. I can
implement
>> it
>> as easy as following:
>>
>>  ???????????? --- a/arch/x86/kernel/kvm.c
> Your email client is using a very strange formatting.
my bad, I insert space to highlight these code.
> This is definitely better than what you proposed so far and implementing it
> as a prove of concept seems to be worthwhile.
>
> But I doubt that this is the final solution. It's not generic and not
> necessarily suitable for all use case scenarios.
>
>yes, I am exhausted :):)


could you tell me the gap to be generic and necessarily suitable for
all use case scenarios? as lack of irq/idle predictors?

 ?I really want to upstream it for all of public cloud users/providers..

as kvm host has a similar one, is it possible to upstream with following 
conditions? :
 ??? 1). add a QEMU configuration, whether enable or not, by default 
disable.
 ??? 2). add some "TODO" comments near the code.
 ??? 3). ...


anyway, thanks for your help..

Quan
 ?Alibaba Cloud

Seemingly Similar Threads

Search for more seemingly similar threads

Linux Virtualization - Nov 2017 - [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

[PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

[PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

[PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path

Seemingly Similar Threads