thr3ads.net - Linux Virtualization - Stolen and degraded time and schedulers [Apr 2007]

If this information is useful, please help other people find it:
Share via:

john stultz

2007-Apr-18 13:02 UTC

Stolen and degraded time and schedulers

On Tue, 2007-03-13 at 09:31 -0700, Jeremy Fitzhardinge
wrote:> The current Linux scheduler makes one big assumption: that 1ms of CPU
> time is the same as any other 1ms of CPU time, and that therefore a
> process makes the same amount of progress regardless of which particular
> ms of time it gets.
> 
> This assumption is wrong now, and will become more wrong as
> virtualization gets more widely used.
> 
> It's wrong now, because it fails to take into account of several kinds
> of missing time:
> 
>    1. interrupts - time spent in an ISR is accounted to the current
>       process, even though it gets no direct benefit
>    2. SMM - time is completely lost from the kernel
>    3. slow CPUs - 1ms of 600MHz CPU is less useful than 1ms of 2.4GHz CPU
> 
[snip]> So how to deal with this?  Basically we need a clock which measures
"CPU
> work units", and have the scheduler use this clock.
> 
> A "CPU work unit" clock has these properties:
> 
>     * inherently per-CPU (from the kernel's perspective, so it would be
>       per-VCPU in a virtual machine)
>     * monotonic - you can't do negative work
>     * measured in "work units"
[snip]> So, how to implement this?
> 
> One quick hack would be to just make a new clocksource entrypoint, which
> returns work units rather than real-time cycles.  That would be fairly
> simple to implement, but it doesn't really take the per-cpu nature of
> the clock into account (since its possible that different cpus on the
> same machine might need their own methods).
> 
> Perhaps a better fit would be an entity which is equivalent to a
> clocksource, but registered per-cpu like (some) clockevents.
> 
> I don't have a particular preference, but I wonder what the clock gurus
> think.
My gut reaction would be to avoid using clocksources for now. While
there is some thought going into how to expand clocksources for other
uses (Daniel is working on this, for example), the design for
clocksources has been very focused on its utility to timekeeping, so I'm
hesitant to try complicate the clocksources in order to multiplex
functionality until what is really needed is well understood.

I suspect the best approach would be see how the sched_clock interface
can be reworked/used for what you want, as it's design goals map closest
to the work-unit properties you list above.

Then we can look to see how clocksources can be best used to implement
the sched_clock interface.

-john

Jeremy Fitzhardinge

2007-Apr-18 13:02 UTC

head link

Stolen and degraded time and schedulers

The current Linux scheduler makes one big assumption: that 1ms of CPU
time is the same as any other 1ms of CPU time, and that therefore a
process makes the same amount of progress regardless of which particular
ms of time it gets.

This assumption is wrong now, and will become more wrong as
virtualization gets more widely used.

It's wrong now, because it fails to take into account of several kinds
of missing time:

   1. interrupts - time spent in an ISR is accounted to the current
      process, even though it gets no direct benefit
   2. SMM - time is completely lost from the kernel
   3. slow CPUs - 1ms of 600MHz CPU is less useful than 1ms of 2.4GHz CPU

The first two - time lost to interrupts - are a well known problem, and
are generally considered to be a non issue.  If you're losing a
significant amount of time to interrupts, you probably have bigger
problems.  (Or maybe not?)

The third is not something I've seen discussed before, but it seems like
it could be a significant problem today.  Certainly, I've noticed it
myself: an interactive program decides to do something CPU-intensive
(like start an animation), and it chugs until the conservative governor
brings the CPU up to speed.  Certainly some of this is because its just
plain CPU-starved, but I think another factor is that it gets penalized
for running on a slow CPU: 1ms is not 1ms.  And for power reasons you
want to encourage processes to run on slow CPUs rather than penalize them.

Virtualization just exacerbates this.  If you have a busy machine
running multiple virtual CPUs, then each VCPU may only get a small
proportion of the total amount of available CPU time.  If the kernel's
scheduler asserts that "you were just scheduled for 1ms, therefore you
made 1ms of progress", then many timeslices will effectively end up
being 1ms of 0Mhz CPU - because the VCPU wasn't scheduled and the real
CPU was doing something else.


So how to deal with this?  Basically we need a clock which measures "CPU
work units", and have the scheduler use this clock.

A "CPU work unit" clock has these properties:

    * inherently per-CPU (from the kernel's perspective, so it would be
      per-VCPU in a virtual machine)
    * monotonic - you can't do negative work
    * measured in "work units"

A "work unit" is probably most simply expressed in cycles - you assume
a
cycle of CPU time is equivalent in terms of work done to any other
cycle.  This means that 1 cycle at 600MHz is equivalent to 1 cycle at
2.4GHz - but of course the 2.4GHz processor gets 4 times as many in any
real time interval.  (This is the instance where the worst kind of tsc -
varying speed which stops on idle - is actually exactly what you want.)

You could also measure "work units" in terms of normalized time units:
if the fastest CPU on the machine is 2.4GHz, then 1ms is 1ms a work unit
on that CPU, but 250us on the 600MHz CPU.

It doesn't really matter what the unit is, so long as it is used
consistently to measure how much progress all processes made.


So, how to implement this?

One quick hack would be to just make a new clocksource entrypoint, which
returns work units rather than real-time cycles.  That would be fairly
simple to implement, but it doesn't really take the per-cpu nature of
the clock into account (since its possible that different cpus on the
same machine might need their own methods).

Perhaps a better fit would be an entity which is equivalent to a
clocksource, but registered per-cpu like (some) clockevents.

I don't have a particular preference, but I wonder what the clock gurus
think.

    J

Con Kolivas

2007-Apr-18 13:02 UTC

head link

Stolen and degraded time and schedulers

On Wednesday 14 March 2007 03:31, Jeremy Fitzhardinge
wrote:> The current Linux scheduler makes one big assumption: that 1ms of CPU
> time is the same as any other 1ms of CPU time, and that therefore a
> process makes the same amount of progress regardless of which particular
> ms of time it gets.
>
> This assumption is wrong now, and will become more wrong as
> virtualization gets more widely used.
>
> It's wrong now, because it fails to take into account of several kinds
> of missing time:
>
>    1. interrupts - time spent in an ISR is accounted to the current
>       process, even though it gets no direct benefit
>    2. SMM - time is completely lost from the kernel
>    3. slow CPUs - 1ms of 600MHz CPU is less useful than 1ms of 2.4GHz CPU
>
> The first two - time lost to interrupts - are a well known problem, and
> are generally considered to be a non issue.  If you're losing a
> significant amount of time to interrupts, you probably have bigger
> problems.  (Or maybe not?)
>
> The third is not something I've seen discussed before, but it seems
like
> it could be a significant problem today.  Certainly, I've noticed it
> myself: an interactive program decides to do something CPU-intensive
> (like start an animation), and it chugs until the conservative governor
> brings the CPU up to speed.  Certainly some of this is because its just
> plain CPU-starved, but I think another factor is that it gets penalized
> for running on a slow CPU: 1ms is not 1ms.  And for power reasons you
> want to encourage processes to run on slow CPUs rather than penalize them.
>
> Virtualization just exacerbates this.  If you have a busy machine
> running multiple virtual CPUs, then each VCPU may only get a small
> proportion of the total amount of available CPU time.  If the kernel's
> scheduler asserts that "you were just scheduled for 1ms, therefore you
> made 1ms of progress", then many timeslices will effectively end up
> being 1ms of 0Mhz CPU - because the VCPU wasn't scheduled and the real
> CPU was doing something else.
>
>
> So how to deal with this?  Basically we need a clock which measures
"CPU
> work units", and have the scheduler use this clock.
>
> A "CPU work unit" clock has these properties:
>
>     * inherently per-CPU (from the kernel's perspective, so it would be
>       per-VCPU in a virtual machine)
>     * monotonic - you can't do negative work
>     * measured in "work units"
>
> A "work unit" is probably most simply expressed in cycles - you
assume a
> cycle of CPU time is equivalent in terms of work done to any other
> cycle.  This means that 1 cycle at 600MHz is equivalent to 1 cycle at
> 2.4GHz - but of course the 2.4GHz processor gets 4 times as many in any
> real time interval.  (This is the instance where the worst kind of tsc -
> varying speed which stops on idle - is actually exactly what you want.)
>
> You could also measure "work units" in terms of normalized time
units:
> if the fastest CPU on the machine is 2.4GHz, then 1ms is 1ms a work unit
> on that CPU, but 250us on the 600MHz CPU.
>
> It doesn't really matter what the unit is, so long as it is used
> consistently to measure how much progress all processes made.
I think you're looking for a complex solution to a problem that doesn't
exist.
The job of the process scheduler is to meter out the available cpu resources. 
It cannot make up cycles for a slow cpu or one that is throttled. If the 
problem is happening due to throttling it should be fixed by altering the 
throttle. The example you describe with the conservative governor is as easy 
to fix as changing to the ondemand governor. Differential power cpus on an 
SMP machine should be managed by SMP balancing choices based on power groups.

It would be fine to implement some other accounting of this definition of time 
for other purposes but not for process scheduler decisions per se.

Sorry to chime in late.  My physical condition prevents me spending any 
extended period of time at the computer so I've tried to be succinct with my
comments and may not be able to reply again.

-- 
-ck

Maybe Matching Threads

Search for more apparently analagous threads

Linux Virtualization - Apr 2007 - Stolen and degraded time and schedulers

Stolen and degraded time and schedulers

Stolen and degraded time and schedulers

Stolen and degraded time and schedulers

Maybe Matching Threads