On Tue, 2007-03-13 at 09:31 -0700, Jeremy Fitzhardinge wrote:> The current Linux scheduler makes one big assumption: that 1ms of CPU > time is the same as any other 1ms of CPU time, and that therefore a > process makes the same amount of progress regardless of which particular > ms of time it gets. > > This assumption is wrong now, and will become more wrong as > virtualization gets more widely used. > > It's wrong now, because it fails to take into account of several kinds > of missing time: > > 1. interrupts - time spent in an ISR is accounted to the current > process, even though it gets no direct benefit > 2. SMM - time is completely lost from the kernel > 3. slow CPUs - 1ms of 600MHz CPU is less useful than 1ms of 2.4GHz CPU >[snip]> So how to deal with this? Basically we need a clock which measures "CPU > work units", and have the scheduler use this clock. > > A "CPU work unit" clock has these properties: > > * inherently per-CPU (from the kernel's perspective, so it would be > per-VCPU in a virtual machine) > * monotonic - you can't do negative work > * measured in "work units"[snip]> So, how to implement this? > > One quick hack would be to just make a new clocksource entrypoint, which > returns work units rather than real-time cycles. That would be fairly > simple to implement, but it doesn't really take the per-cpu nature of > the clock into account (since its possible that different cpus on the > same machine might need their own methods). > > Perhaps a better fit would be an entity which is equivalent to a > clocksource, but registered per-cpu like (some) clockevents. > > I don't have a particular preference, but I wonder what the clock gurus > think.My gut reaction would be to avoid using clocksources for now. While there is some thought going into how to expand clocksources for other uses (Daniel is working on this, for example), the design for clocksources has been very focused on its utility to timekeeping, so I'm hesitant to try complicate the clocksources in order to multiplex functionality until what is really needed is well understood. I suspect the best approach would be see how the sched_clock interface can be reworked/used for what you want, as it's design goals map closest to the work-unit properties you list above. Then we can look to see how clocksources can be best used to implement the sched_clock interface. -john
The current Linux scheduler makes one big assumption: that 1ms of CPU time is the same as any other 1ms of CPU time, and that therefore a process makes the same amount of progress regardless of which particular ms of time it gets. This assumption is wrong now, and will become more wrong as virtualization gets more widely used. It's wrong now, because it fails to take into account of several kinds of missing time: 1. interrupts - time spent in an ISR is accounted to the current process, even though it gets no direct benefit 2. SMM - time is completely lost from the kernel 3. slow CPUs - 1ms of 600MHz CPU is less useful than 1ms of 2.4GHz CPU The first two - time lost to interrupts - are a well known problem, and are generally considered to be a non issue. If you're losing a significant amount of time to interrupts, you probably have bigger problems. (Or maybe not?) The third is not something I've seen discussed before, but it seems like it could be a significant problem today. Certainly, I've noticed it myself: an interactive program decides to do something CPU-intensive (like start an animation), and it chugs until the conservative governor brings the CPU up to speed. Certainly some of this is because its just plain CPU-starved, but I think another factor is that it gets penalized for running on a slow CPU: 1ms is not 1ms. And for power reasons you want to encourage processes to run on slow CPUs rather than penalize them. Virtualization just exacerbates this. If you have a busy machine running multiple virtual CPUs, then each VCPU may only get a small proportion of the total amount of available CPU time. If the kernel's scheduler asserts that "you were just scheduled for 1ms, therefore you made 1ms of progress", then many timeslices will effectively end up being 1ms of 0Mhz CPU - because the VCPU wasn't scheduled and the real CPU was doing something else. So how to deal with this? Basically we need a clock which measures "CPU work units", and have the scheduler use this clock. A "CPU work unit" clock has these properties: * inherently per-CPU (from the kernel's perspective, so it would be per-VCPU in a virtual machine) * monotonic - you can't do negative work * measured in "work units" A "work unit" is probably most simply expressed in cycles - you assume a cycle of CPU time is equivalent in terms of work done to any other cycle. This means that 1 cycle at 600MHz is equivalent to 1 cycle at 2.4GHz - but of course the 2.4GHz processor gets 4 times as many in any real time interval. (This is the instance where the worst kind of tsc - varying speed which stops on idle - is actually exactly what you want.) You could also measure "work units" in terms of normalized time units: if the fastest CPU on the machine is 2.4GHz, then 1ms is 1ms a work unit on that CPU, but 250us on the 600MHz CPU. It doesn't really matter what the unit is, so long as it is used consistently to measure how much progress all processes made. So, how to implement this? One quick hack would be to just make a new clocksource entrypoint, which returns work units rather than real-time cycles. That would be fairly simple to implement, but it doesn't really take the per-cpu nature of the clock into account (since its possible that different cpus on the same machine might need their own methods). Perhaps a better fit would be an entity which is equivalent to a clocksource, but registered per-cpu like (some) clockevents. I don't have a particular preference, but I wonder what the clock gurus think. J
On Wednesday 14 March 2007 03:31, Jeremy Fitzhardinge wrote:> The current Linux scheduler makes one big assumption: that 1ms of CPU > time is the same as any other 1ms of CPU time, and that therefore a > process makes the same amount of progress regardless of which particular > ms of time it gets. > > This assumption is wrong now, and will become more wrong as > virtualization gets more widely used. > > It's wrong now, because it fails to take into account of several kinds > of missing time: > > 1. interrupts - time spent in an ISR is accounted to the current > process, even though it gets no direct benefit > 2. SMM - time is completely lost from the kernel > 3. slow CPUs - 1ms of 600MHz CPU is less useful than 1ms of 2.4GHz CPU > > The first two - time lost to interrupts - are a well known problem, and > are generally considered to be a non issue. If you're losing a > significant amount of time to interrupts, you probably have bigger > problems. (Or maybe not?) > > The third is not something I've seen discussed before, but it seems like > it could be a significant problem today. Certainly, I've noticed it > myself: an interactive program decides to do something CPU-intensive > (like start an animation), and it chugs until the conservative governor > brings the CPU up to speed. Certainly some of this is because its just > plain CPU-starved, but I think another factor is that it gets penalized > for running on a slow CPU: 1ms is not 1ms. And for power reasons you > want to encourage processes to run on slow CPUs rather than penalize them. > > Virtualization just exacerbates this. If you have a busy machine > running multiple virtual CPUs, then each VCPU may only get a small > proportion of the total amount of available CPU time. If the kernel's > scheduler asserts that "you were just scheduled for 1ms, therefore you > made 1ms of progress", then many timeslices will effectively end up > being 1ms of 0Mhz CPU - because the VCPU wasn't scheduled and the real > CPU was doing something else. > > > So how to deal with this? Basically we need a clock which measures "CPU > work units", and have the scheduler use this clock. > > A "CPU work unit" clock has these properties: > > * inherently per-CPU (from the kernel's perspective, so it would be > per-VCPU in a virtual machine) > * monotonic - you can't do negative work > * measured in "work units" > > A "work unit" is probably most simply expressed in cycles - you assume a > cycle of CPU time is equivalent in terms of work done to any other > cycle. This means that 1 cycle at 600MHz is equivalent to 1 cycle at > 2.4GHz - but of course the 2.4GHz processor gets 4 times as many in any > real time interval. (This is the instance where the worst kind of tsc - > varying speed which stops on idle - is actually exactly what you want.) > > You could also measure "work units" in terms of normalized time units: > if the fastest CPU on the machine is 2.4GHz, then 1ms is 1ms a work unit > on that CPU, but 250us on the 600MHz CPU. > > It doesn't really matter what the unit is, so long as it is used > consistently to measure how much progress all processes made.I think you're looking for a complex solution to a problem that doesn't exist. The job of the process scheduler is to meter out the available cpu resources. It cannot make up cycles for a slow cpu or one that is throttled. If the problem is happening due to throttling it should be fixed by altering the throttle. The example you describe with the conservative governor is as easy to fix as changing to the ondemand governor. Differential power cpus on an SMP machine should be managed by SMP balancing choices based on power groups. It would be fine to implement some other accounting of this definition of time for other purposes but not for process scheduler decisions per se. Sorry to chime in late. My physical condition prevents me spending any extended period of time at the computer so I've tried to be succinct with my comments and may not be able to reply again. -- -ck